DomainShiftVLM

This repository contains the data and the code for the paper "Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning".

Overview

This project evaluates the robustness of vision-language models (VLMs) to domain shift in object captioning tasks. It provides:

Segmentation pipeline using SAM2 for automatic object detection
Captioning pipeline supporting multiple VLMs (BLIP2, Gemma, LLaMA Vision, Qwen, LLaVA, Mistral, SmolVLM2)
Evaluation metrics including CIDEr, BERTScore, ROUGE-L, GPTScore, and CLIPScore
Domain shift analysis comparing performance on real vs. synthetic (3D) data

Setup

Prerequisites

Python 3.10
CUDA-compatible GPU (12GB or 24GB VRAM recommended)
Conda package manager

Installation

Create and activate conda environment:

conda create -y -n vlm python=3.10
conda activate vlm
pip install -r requirements.txt

Authenticate with Hugging Face:
```
huggingface-cli login
```
(Create a token from your Hugging Face account if needed.)
Download NLTK resources:
```
import nltk
nltk.download('wordnet')
```

Install and configure Ollama:

curl https://ollama.ai/install.sh | sh
ollama serve

Download Ollama models (specify your GPU's VRAM size: 12 or 24):

bash download_from_ollama.sh 12  # For 12GB VRAM
# OR
bash download_from_ollama.sh 24  # For 24GB VRAM

Download SAM2 checkpoint:

wget -P checkpoints/ https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt

Quick Start

Run the complete pipeline (segmentation + captioning with all VLMs):

bash run_all.sh 12  # For 12GB VRAM
# OR
bash run_all.sh 24  # For 24GB VRAM

During the first run, models will be downloaded from HuggingFace and stored in ~/.cache/huggingface.

For evaluation, run the Jupyter notebook:

jupyter notebook first_frame_eval.ipynb

Codebase Structure

DomainShiftVLM/
├── captioning/              # Caption generation modules
│   ├── captioner.py         # Abstract base class for all captioners
│   ├── blip2_captioner.py   # BLIP2 model implementation
│   ├── ollama_captioner.py  # Ollama-based VLM wrapper (llama, qwen, gemma, llava, mistral)
│   ├── gemma3n_captioner.py # Gemma 3N model implementation
│   └── smolvlm2_captioner.py # SmolVLM2 model implementation
│
├── segmentation/            # Segmentation modules
│   └── segmentor.py         # SAM2-based segmentation implementation
│
├── evaluation/              # Evaluation metrics
│   ├── benchmark.py         # Main evaluation pipeline with NLP metrics (CIDEr, BERTScore, ROUGE, GPTScore)
│   ├── clipscore.py         # CLIP-based evaluation
│   └── gptscore.py          # GPT-based scoring
│
├── data/                    # Dataset storage (3d/, real/)
├── output/                  # Output storage for segmentation and captioning results
├── checkpoints/             # Model checkpoints (SAM2)
│
├── caption_data.py          # Main script for running captioning pipeline
├── configs.json             # Configuration file for prompts and settings
├── ollama_map.json          # Mapping of VLM models to Ollama model names by VRAM size
├── utils.py                 # Utility functions (VRAM info)
├── run_all.sh               # Script to run full pipeline (segmentation + captioning)
├── download_from_ollama.sh  # Script to download Ollama models
├── first_frame_eval.ipynb   # Jupyter notebook for evaluation
└── requirements.txt         # Python dependencies

Key Components

Captioning Module (`captioning/`)

All captioners inherit from the abstract Captioner base class, which defines three key methods:

_init_models(): Initialize the model and processor
caption(imgs, user_prompt=None): Generate captions for a list of images
stop(): Clean up resources and free GPU memory

Available Captioners:

BLIP2 (blip2_captioner.py): Salesforce/blip2-opt-2.7b with 8-bit quantization
Ollama VLMs (ollama_captioner.py): Wrapper for Ollama-based models:
- llama_vision (Llama 3.2 Vision)
- qwen (Qwen 2.5 Vision)
- gemma (Gemma 3)
- llava (LLaVA)
- mistral (Mistral Small)
Gemma 3N (gemma3n_captioner.py): Google's Gemma 3N model
SmolVLM2 (smolvlm2_captioner.py): Lightweight vision-language model

Segmentation Module (`segmentation/`)

Uses SAM2 (Segment Anything 2) for automatic object segmentation
Generates masks and bounding boxes for objects in frames
Includes mask cleaning and blob removal for better quality
Outputs: segmentation masks, bounding boxes, and visualization images

Evaluation Module (`evaluation/`)

NLP Metrics: CIDEr, BERTScore, ROUGE-L, GPTScore
Vision Metrics: CLIPScore for image-text alignment
Benchmark tools for comparing model performance across domains

Configuration Files

`configs.json`

Contains prompts for each VLM model and device settings:

{
  "general": {
    "device": "cuda"
  },
  "prompts": {
    "vlm": {
      "gemma": "Describe in detail the object...",
      ...
    }
  }
}

`ollama_map.json`

Maps VLM names to specific Ollama model variants based on available VRAM:

12GB VRAM: Smaller models (e.g., llama3.2-vision:11b, qwen2.5vl:7b)
24GB VRAM: Larger models (e.g., qwen2.5vl:32b, gemma3:27b)

Data Organization

Input Data Structure

data/
├── 3d/
│   ├── frame0000.png
│   ├── frame0001.png
│   └── ...
└── real/
    ├── frame0000.png
    ├── frame0001.png
    └── ...

Output Data Structure

output/
├── 3d/
│   ├── sam2_segmentation/       # Segmentation visualizations
│   ├── sam2_tracking/           # Tracking data (masks, bboxes)
│   │   └── tracking_data.npz
│   └── caption/
│       ├── blip2/
│       │   ├── with_masks/
│       │   │   └── all_captions.csv
│       │   └── without_masks/
│       │       └── all_captions.csv
│       ├── gemma3n/
│       └── ...
└── real/
    └── (same structure as 3d/)

Usage

Running the Complete Pipeline

Run segmentation and captioning on all VLMs for all datasets:

bash run_all.sh 12  # For 12GB VRAM
bash run_all.sh 24  # For 24GB VRAM

Running Individual Components

Segmentation Only

python segmentation/segmentor.py --dataset real --image data/real/frame0000.png

Captioning with Specific Model

Without masks:

python caption_data.py --captioner blip2 --dataset real

With masks:

python caption_data.py --captioner gemma3n --use_masks --dataset 3d --tot_vram_gb 24

Captioning with Specific Ollama VLM

python caption_data.py --captioner llama_vision --dataset real --tot_vram_gb 12
python caption_data.py --captioner qwen --dataset real --tot_vram_gb 24

Evaluation

Run the evaluation notebook to analyze results:

jupyter notebook first_frame_eval.ipynb

The notebook computes various metrics (CIDEr, BERTScore, ROUGE-L, GPTScore, CLIPScore) and compares model performance across domains.

Development Guidelines

Adding a New VLM Model

To add a new vision-language model to the codebase:

Create a new captioner class in captioning/:

# captioning/your_model_captioner.py
from captioning.captioner import Captioner

class YourModelCaptioner(Captioner):
    def __init__(self, device='cuda:0'):
        super().__init__(device)
        self._init_models()
    
    def _init_models(self):
        # Initialize your model and processor
        pass
    
    def caption(self, imgs, user_prompt=None):
        # Generate captions for images
        # Return list of caption strings
        pass
    
    def stop(self):
        # Clean up resources
        pass

Import the captioner in caption_data.py:

from captioning.your_model_captioner import YourModelCaptioner

Add to model selection in caption_data.py:

def select_captioner(captioner_name, tot_vram_gb, device):
    if captioner_name == "your_model":
        return YourModelCaptioner(device=device)
    # ... existing cases

Add prompt configuration in configs.json:

"prompts": {
  "vlm": {
    "your_model": "Your prompt template here..."
  }
}

Update run script in run_all.sh:

VLMS=("smolvlm2" "gemma3n" "blip2" "your_model" ...)

Adding a New Evaluation Metric

To add a new evaluation metric:

Create a metric function in evaluation/benchmark.py or a new file:

def calculate_your_metric(references, candidates):
    """Calculate your custom metric"""
    # Implementation
    return scores

Integrate into benchmark:

def calculate_nlp_metrics(ground_truth, predictions):
    results = {
        'cider': calculate_cider(...),
        'your_metric': calculate_your_metric(...)
    }
    return results

Update evaluation notebook to include your new metric.

Code Style Guidelines

Imports: Standard library, third-party, then local imports
Docstrings: Use clear docstrings for classes and complex functions
Error Handling: Include try-except blocks for model loading and inference
Resource Management: Always implement stop() method to free GPU memory
Batch Processing: Use batch processing with tqdm for progress tracking
Device Management: Support both CUDA and CPU, default to CUDA

Testing Your Changes

Test model loading: Ensure your captioner initializes without errors
Test inference: Verify captions are generated correctly
Test resource cleanup: Confirm GPU memory is freed after stop()
Test integration: Run through full pipeline with your changes

Common Issues and Solutions

Out of Memory (OOM):
- Reduce batch size in captioner
- Use 8-bit quantization
- Select smaller model variants
Model Download Issues:
- Authenticate with HuggingFace: huggingface-cli login
- For Ollama models: ollama pull <model-name>
CUDA Compatibility:
- Install correct PyTorch version for your CUDA version
- Check compatibility: torch.cuda.is_available()
Segmentation Quality:
- Adjust threshold_area in segmentor.py for your dataset
- Modify mask cleaning parameters for better results

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
captioning		captioning
data		data
evaluation		evaluation
output		output
segmentation		segmentation
.gitignore		.gitignore
README.md		README.md
caption_data.py		caption_data.py
configs.json		configs.json
download_from_ollama.sh		download_from_ollama.sh
first_frame_eval.ipynb		first_frame_eval.ipynb
ollama_map.json		ollama_map.json
requirements.txt		requirements.txt
run_all.sh		run_all.sh
utils.py		utils.py

cradlerobotics/DomainShiftVLM

Folders and files

Latest commit

History

Repository files navigation

DomainShiftVLM

Table of Contents

Overview

Setup

Prerequisites

Installation

Quick Start

Codebase Structure

Key Components

Captioning Module (captioning/)

Segmentation Module (segmentation/)

Evaluation Module (evaluation/)

Configuration Files

configs.json

ollama_map.json

Data Organization

Input Data Structure

Output Data Structure

Usage

Running the Complete Pipeline

Running Individual Components

Segmentation Only

Captioning with Specific Model

Captioning with Specific Ollama VLM

Evaluation

Development Guidelines

Adding a New VLM Model

Adding a New Evaluation Metric

Code Style Guidelines

Testing Your Changes

Common Issues and Solutions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2