RF100-VL: Multi-Domain VLM Benchmark

Updated 14 November 2025

Roboflow100-VL is a comprehensive multi-domain object detection benchmark that aggregates 100 curated datasets spanning 564 unique object classes with exhaustive annotations.
It evaluates vision-language models using zero-shot, few-shot, semi-supervised, and fully supervised settings with standardized COCO metrics.
The benchmark’s rich multi-modal instructions and federated fine-tuning strategies provide actionable insights to improve out-of-distribution generalization in VLMs.

Roboflow100-VL (RF100-VL) is a large-scale, multi-domain object detection benchmark comprised of 100 community-curated datasets from Roboflow Universe assembled to evaluate and advance the capabilities of vision-LLMs (VLMs) in recognition regimes far beyond “everyday” classes such as those appearing in COCO or Objects365. RF100-VL is designed to expose the weaknesses of state-of-the-art VLMs—specifically their poor out-of-distribution (OOD) generalization—and paper the efficacy of multi-modal few-shot concept alignment as an alternative to further scaling, with evaluation spanning zero-shot, few-shot, semi-supervised, and fully supervised settings.

1. Benchmark Construction and Dataset Properties

RF100-VL aggregates 100 sub-datasets, grouped into seven super-categories determined via CLIP embeddings and human review:

Super-Category	# Classes	# Images (K)	# Boxes (K)
Aerial	29	11.6	186.8
Document	88	21.4	127.1
Flora & Fauna	70	46.7	441.7
Industrial	122	29.8	205.6
Medical	77	16.4	125.4
Sports	36	8.4	58.5
Other	142	29.8	210.3

Across all domains, the benchmark spans 564 unique object categories, 164,149 images, and 1.36 million bounding boxes. Compared to COCO, RF100-VL has ~50% fewer images, increasing accessibility for academic experimentation while greatly expanding domain and category diversity.

All images are exhaustively annotated for their respective class sets in standard COCO JSON format (bounding boxes: xmin, ymin, width, height). Each dataset is split for zero-shot, few-shot (1-, 5-, 10-shot per class, per protocol of Wang et al., 2020), semi-supervised (10% labeled, 90% unlabeled pseudo-labeled via STAC), and fully supervised regimes.

RF100-VL further provides multi-modal annotation instructions for every class, mimicking human concept acquisition during annotation: each class has (a) a handful of visual exemplars and (b) a textual description detailing salient features, edge cases, and negative examples. Initial drafts of these instructions were generated by GPT-4o and manually verified.

2. Evaluation Protocols and Regimes

Evaluation in RF100-VL encompasses several data regimes:

Zero-shot: Models are prompted with class names or full instructions at test time, with no model updates, and must localize and classify novel objects.
Few-shot: For each novel class, K=1, 5, 10 annotated instances are provided. Two approaches are supported: (a) in-context prompting via concatenation of labeled images and instructions, (b) gradient-based federated fine-tuning where per-image annotated classes serve as negatives per Madan et al., CVPR 2024.
Semi-supervised: Models have access to only 10% of the ground-truth labels; the remainder are pseudo-labeled using a teacher-student framework (Sohn et al., STAC, 2020) with strong augmentation.
Fully supervised: All labeled data is available for training a dedicated detector.

Canonical metrics follow COCO conventions: average precision per class (integral of precision-recall curve via

$AP_c = \int_0^1 p_c(r)\,dr$

), with the overall mean average precision (mAP) computed as

$mAP = \frac{1}{|C|} \sum_{c \in C} AP_c$

across all classes. The implementation uses pycocotools and evaluates AP at IoU thresholds from .50 to .95 (with maxDets=500).

3. Baseline Model Evaluation and Results

The benchmark provides systematic evaluation of specialist open-vocabulary detectors (Detic, GroundingDINO), classic detectors (YOLO family), and generalist multi-modal LLMs (e.g., Qwen2.5-VL, Gemini 2.5 Pro) under zero/few-shot settings. Key findings are summarized in the following tables.

Zero-Shot mAP (50:95) by Domain:

Method	Aerial	Doc	Flora	Indust.	Medical	Sports	Other	All
Detic	12.2	4.5	17.9	6.0	0.8	7.6	11.2	9.5
GroundingDINO	21.8	7.9	28.2	10.3	2.1	13.0	18.1	15.7
Qwen2.5-VL (names)	4.6	3.8	10.1	3.8	1.6	5.9	5.4	5.4
Gemini 2.5 Pro (names)	8.4	13.3	22.4	9.7	3.5	11.2	17.1	13.3

Notably, medical and some industrial sub-domains exhibit catastrophic failure (often <2% mAP) in zero-shot settings.

Few-Shot (10-shot) mAP by Domain:

Method	Aerial	Doc	Flora	Ind.	Medical	Sports	Other	All
Detic+FedLoss	19.5	19.6	28.4	25.9	8.5	26.6	25.7	22.8
YOLOv8n+FedLoss	13.9	22.8	22.1	26.6	14.9	13.0	19.7	21.6
GroundingDINO (fine-tune)	31.8	29.6	40.8	37.5	17.9	33.1	32.6	33.3
Qwen2.5-VL (inst+images)	5.7	6.4	14.5	5.5	1.6	7.2	6.7	7.5

GroundingDINO fine-tuned on 10 shots achieves 33.3% mean AP, surpassing the best YOLO variant by more than 10 points.

In the CVPR 2025 Foundational FSOD Challenge (covering a 20-dataset subset), the best entry from NJUST KMG achieved 33.8% mAP (10-shot) compared to GroundingDINO’s baseline of 17.1% (zero-shot), i.e., a +16.7 point improvement.

4. Annotation, Instructions, and Concept Alignment

A distinguishing feature of RF100-VL is the inclusion of rich multi-modal instructions for annotators and models:

Visual exemplars from the few-shot split illustrate category variability.
Detailed textual descriptions specify discriminative features, instruct annotators to estimate occlusions, and enumerate categories or patterns to avoid.

Such class-level, instruction-driven context is intended to facilitate concept alignment, closely analogizing the human annotation process. However, incorporating these instructions yields inconsistent benefits for instruction-tuned MLLMs; rigid input formatting often limits the impact of additional context.

Prompt strategy ablations indicate that single-class prompting benefits Qwen2.5-VL, whereas Gemini 2.5 Pro performs better with multi-class prompts. The selection process for few-shot exemplars also impacts fine-tuning effectiveness; prioritizing images with larger, less occluded boxes improves downstream mAP, and semi-automated heuristics can approach the efficacy of model-guided best splits.

5. Insights on Out-of-Distribution Generalization and Model Scaling

RF100-VL exposes clear limitations in current VLMs and detectors:

Zero-shot OOD generalization is poor: In medical and select industrial imaging, all models surveyed deliver <2% mAP.
Specialist open-vocabulary detectors outperform generalists: Detic and GroundingDINO consistently surpass large MLLMs despite using much less pre-training data, a result attributed to task-specific detectors’ per-box confidence scoring and non-maximum suppression (NMS).
Few-shot finetuning outpaces in-context prompting: Federated fine-tuning and gradient-based approaches improve mean AP by >15 points; large-scale detection pre-training (as in GroundingDINO) yields the highest few-shot accuracy.
Classic detector scaling fatigues on fine-grained/rare categories: YOLOv11 outperforms YOLOv8 on COCO but does not yield gains on RF100-VL, and increased model size fails to overcome data scarcity or label imbalance.
Metric computation varies by implementation: Toolkits such as Ultralytics may inflate mAP by up to 3.4% compared to pycocotools if alternative AP integration methods are used.

A plausible implication is that model scaling alone is insufficient for OOD generalization; progress will require principled multi-modal concept alignment strategies grounded in structured annotation and instructions analogous to those provided to human annotators.

6. Relevance, Accessibility, and Future Directions

RF100-VL’s design—broad category and modality coverage yet modest image counts—makes it well suited for academic research and controlled ablation studies. By providing comprehensive annotation, multi-modal instructions, and standardized splits for four learning regimes, the benchmark enables systematic investigation of emerging detection paradigms, especially at the intersection of vision, language, and semi-supervised learning.

The dataset and code are open-source and accessible at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/. RF100-VL forms the basis for recent community benchmarking initiatives, including the CVPR 2025 FSOD competition, and is positioned to inform the development and evaluation of future vision-LLMs targeting specialized, real-world applications.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Roboflow100-VL.