Papers
Topics
Authors
Recent
2000 character limit reached

RF100-VL Benchmark for Object Detection

Updated 14 November 2025
  • RF100-VL is a multi-domain object detection benchmark that challenges vision-language models on out-of-distribution classes and diverse imaging modalities.
  • It aggregates 100 carefully curated datasets across seven super-domains with exhaustively verified, re-annotated labels to ensure quality.
  • Evaluation across zero-shot, few-shot, semi-supervised, and fully-supervised regimes highlights VLM limitations and guides method enhancements.

Roboflow100-VL (RF100-VL) is a large-scale multi-domain object detection benchmark designed to probe the generalization capabilities of vision-LLMs (VLMs) across out-of-distribution classes, rare domains, and diverse imaging modalities. Comprising 100 independently sourced detection datasets with exhaustively verified and, if needed, re-annotated labels, RF100-VL provides a rigorous testing ground for zero-shot, few-shot, semi-supervised, and fully-supervised learning regimes. The benchmark specifically targets shortcomings of current VLMs—such as CLIP, Qwen-VL, and Gemini—when confronted with novel categories and formats poorly represented in their pre-training data, and it establishes protocols and baselines that support exhaustive, scenario-driven comparison.

1. Motivation and Scope

VLMs have demonstrated strong zero-shot detection capability for ubiquitous objects (e.g., car, person), yet their performance sharply degrades on out-of-distribution classes, scientific acronyms, context-dependent terms, specialist domains, and non-RGB modalities such as X-ray or aerial imagery. RF100-VL was constructed with a twofold objective: (a) to assemble a diverse suite of 100 detection datasets—each representing real-world, underrepresented concepts not typically encountered by internet-scale pre-trained models; and (b) to facilitate systematic evaluation of VLMs when provided with only class name prompts (zero-shot), limited annotated examples plus instructions (few-shot), partial labels with pseudo-labeling (semi-supervised), or full supervision. By explicitly incorporating non-trivial categories and format variation—including nuanced material properties (e.g., soft vs. hard plastic) and ambiguous, context-dependent terms (“block” in volleyball)—RF100-VL prevents simple lexical prompting from artificially inflating model scores and instead tests a model’s ability for true concept alignment.

2. Dataset Composition and Domain Structure

RF100-VL sources 100 carefully vetted datasets from Roboflow Universe, ensuring label quality and category exhaustiveness through manual review and re-annotation where required. Datasets are grouped into seven “super-domains,” each reflecting a distinct combination of class complexity, imaging modality, and application field:

Domain # Classes # Images # Annotations
Aerial 29 11,627 186,789
Document 88 21,418 127,129
Flora & Fauna 70 46,718 441,677
Industrial 122 29,758 205,627
Medical 77 16,369 125,433
Sports 36 8,443 58,508
Other 142 29,816 210,328
All 564 164,149 1,355,491

The distribution ensures that “Flora & Fauna” is the largest by image count and “Other” contains the largest number of classes. The total image count (~164K) is approximately 50% of the COCO dataset, making RF100-VL computationally tractable in academic settings while vastly expanding the diversity of detection challenges.

3. Annotation Protocols and Concept Alignment

RF100-VL employs a dual-stage annotation approach: Automated generation of “multi-modal annotator instructions” using GPT-4o, followed by manual human verification. For each dataset, an instruction bundle encapsulates a class definition, per-class labeling guidelines (specifying both what to annotate and what to exclude), and visual reference examples. This mirrors professional annotation workflows, where labelers require precise, multi-modal instruction to capture subtle visual concepts. In benchmark settings, these same instructions—containing both text and images—are presented to VLMs as input, operationalizing the “few-shot concept alignment” protocol, and probing models’ ability to interpret combined modality guidance in both in-context and fine-tuning scenarios.

4. Task Definitions and Evaluation Regimes

Each constituent dataset in RF100-VL is independently subject to four standardized learning regimes:

  • Zero-Shot: Models are prompted using only class names or textual class descriptions, with no access to ground-truth visual examples and no gradient updates (model weights remain fixed).
  • Few-Shot (K=10): For each class, K images containing a total of 10 instances per class are used for either visual prompting plus instructions (in-context learning) or for batch fine-tuning, following the protocol of Wang et al. 2020.
  • Semi-Supervised: Each dataset’s train split is sub-divided, with 10% fully labeled and the remaining 90% treated as unlabeled; STAC pseudo-labeling is employed to exploit the full sample.
  • Fully-Supervised: Training is conducted with full access to all official labels using standard COCO detection losses.

All trained or prompted models are evaluated against the same, fully annotated test datasets, ensuring comparability across regimes and methods.

5. Evaluation Metrics and Protocols

RF100-VL adopts common detection metrics, including Intersection over Union (IoU), Average Precision (AP), and mean Average Precision (mAP):

IoU(Bp,Bgt)=area(BpBgt)area(BpBgt)\mathrm{IoU}(B_p,B_{gt}) = \frac{\mathrm{area}(B_p\cap B_{gt})}{\mathrm{area}(B_p\cup B_{gt})}

mAP=1CcCAPc\mathrm{mAP} = \frac{1}{|C|}\sum_{c\in C} \mathrm{AP}_c

AP is computed as the area under the precision–recall curve for a specified IoU threshold (typically 0.50:0.05:0.95), and mAP averages AP across all classes within the respective dataset. The evaluation scripts use pycocotools with maxDets=500, in line with COCO evaluation protocols, to ensure scalability and consistency.

6. Baseline Model Performance Across Regimes

RF100-VL provides a comparative landscape of baseline and strong detector performance. Notable results include:

  • Zero-Shot: Models such as Detic (LVIS+COCO+IN21K), GroundingDINO (frozen), OWLv2, MQ-GLIP, Qwen 2.5-VL (72B), and Gemini 2.5 Pro achieve between 5.4–15.7 mAP overall, but less than 2% mAP on medical imaging datasets. This highlights near-catastrophic failure of state-of-the-art methods in novel clinical domains.
  • Few-Shot (10 shots/class): Fine-tuned GroundingDINO achieves 33.3 mAP (an increase from 17.9 to +15.8 mAP on medical), Detic with federated fine-tuning 22.8 mAP, and YOLOv8 variants approximately 21–23 mAP. In contrast, massive multi-modal LLMs (Qwen 2.5-VL, Gemini 2.5 Pro) achieve only 7.5–9.2 mAP, even with annotator instructions, indicating that specialist detectors fine-tuned with limited data outperform in-context approaches by a 3×–4× margin.
  • Fully-Supervised: YOLOv8m and YOLOv11m yield mAPs of 56.4 and 56.5, and the overall upper bound (LW-DETRm) reaches 59.4 mAP.
  • CVPR 2025 Foundational FSOD Challenge: On the Roboflow20-VL subset, the best zero-shot baseline (GroundingDINO) reached 17.1 mAP, while the winning submission (NJUST KMG) achieved 33.8 mAP, +16.8 over baseline, with significant gains driven by few-shot split selection, augmentation, and ensembling.

7. Key Insights, Limitations, and Community Contributions

RF100-VL reveals core limitations of current VLMs. Zero-shot generalization remains limited, evidencing that scale and pre-training on internet data do not ensure transfer to rare or specialist domains. Few-shot fine-tuning of detectors (e.g., GroundingDINO, Detic) provides the greatest performance gains, overwhelming in-context prompting capabilities of current multi-modal LLMs. Annotator instructions—including visual cues—are pivotal for human annotation, yet yield inconsistent improvements for models: in some cases (e.g., Gemini) performance degrades when such instructions are supplied, suggesting unresolved issues in multimodal context integration.

Semi-supervised learning with STAC, leveraging only 10% labeled data, attains up to 80% of fully-supervised performance, illustrating the efficacy of pseudo-labeling in low-label regimes. Notably, increasing detector scale (e.g., YOLOv11) or COCO performance does not guarantee cross-domain robustness, implying potential overfitting to conventional benchmarks. The CVPR 2025 FSOD Challenge substantiated that targeted split selection, aggressive data augmentation, and ensembling can further lift performance by 3–5 mAP over naive fine-tuning.

8. Resources, Availability, and Reproducibility

RF100-VL offers comprehensive resources to support the research community:

RF100-VL establishes a new standard for multi-modal, multi-domain object detection evaluation, foregrounding the need for models that can adapt to novel visual concepts via limited supervision, and robustly integrate multi-modal instructions as part of their real-world deployment.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RF100-VL Benchmark.