Accurate vehicle damage assessment is critical for automotive eCommerce platforms, where buyers depend on transparent visual evidence to make informed purchasing decisions. Traditional object detectors (e.g., YOLO, Mask R-CNN) can localise damage but fail to provide contextual explanations such as type, location, and severity.
To address this gap, we introduce CarDVLM, a multimodal framework that integrates GroundingCarDD with a fine-tuned vision–language model (LLaVA). The system detects and localises damages, then generates structured, query-driven textual descriptions, enabling interpretable and user-centric assessments.
Paper (Under Review)
-
CarDVLM Framework
A domain-adapted system combining phrase-grounded object detection with multimodal reasoning for interpretable car damage analysis. -
Structured Automotive Dataset
Constructed from public (CarDD) and private sources, with bounding box annotations and semantically aligned descriptions across 25 vehicle body parts. -
CarDamageEval
A two-tier evaluation framework that measures structured accuracy with precision, recall, F1, and human-aligned evaluation. -
Ablation Study under Real-World Scenarios
Comprehensive testing across three challenging conditions:- Clearly visible damages
- Spatially ambiguous damages (left/right or front/rear orientation)
- Extremely subtle or partially obscured damages
This validates CarDVLM’s robustness and practical deployment readiness.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| ChatGPT | 59.5 | 59.6 | 65.5 | 61.5 |
| LLaMA | 70.6 | 77.6 | 75.5 | 75.3 |
| Qwen | 74.9 | 87.4 | 80.5 | 82.3 |
| CarDVLM | 86.7 | 88.8 | 90.2 | 89.5 |
CarDVLM delivers state-of-the-art structured accuracy, outperforming all baselines.
Figure: CarDVLM integrates GroundingCarDD with a fine-tuned VLM (CLIP + LLaMA-2 13B + LoRA).
- Download and install Miniconda from the official site:
👉 https://docs.conda.io/en/latest/miniconda.html
git clone https://github.com/HelloJahid/CarDVLM
cd CarDVLMconda create -n llava python=3.10 -y
conda activate llavapip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn==2.2.0
pip install peft==0.10.0
pip install deepspeedFollow the instructions here: GroundingCarDD Installation Guide
This project builds upon the LLaVA-1.5 vision–language model developed by Haotian Liu and contributors.
- 🌐 Original project: LLaVA GitHub Repository
- 📜 Licensed under: Apache License 2.0
For more details, see the LICENSE and NOTICE files.
