DatBench-Full: VLM Evaluation Suite
- DatBench-Full is a curated evaluation suite for vision-language models that ensures multimodal processing and genuine reasoning.
- It employs a rigorous transformation and filtering pipeline that converts MCQs to generative tasks and removes blind-solvable or ambiguous data.
- The suite achieves high discriminative power with up to 13× efficiency gains, reducing computational costs while preserving evaluation quality.
DatBench-Full is a curated, large-scale evaluation suite for vision-LLMs (VLMs) that enforces three central desiderata—faithfulness, discriminability, and computational efficiency. It comprises 33 cleaned datasets, partitioned across nine VLM capabilities, and is constructed through a rigorous transformation and filtering pipeline intended to address chronic evaluation failures in legacy benchmarks. The release of DatBench-Full is accompanied by DatBench, a highly discriminative subset that facilitates rapid iteration at a fraction of the computational cost (Joshi et al., 5 Jan 2026).
1. Foundational Desiderata: Faithfulness, Discriminability, Efficiency
DatBench-Full formalizes its evaluation philosophy by explicitly requiring:
- Faithfulness: Each evaluation item mandates genuine multimodal reasoning—correct answers are attainable only if both the image and the associated query are processed. Formally, the suite filters to enforce per dataset.
- Discriminability: Datasets are curated to maximize the separation between models of differing capabilities using the point-biserial correlation , computed per item across models’ global performance scores and binary correctness. The aggregate discriminative power is .
- Efficiency: The suite is organized to optimize discriminative power per unit of compute. Computation cost is quantified as , where denotes GPU-hours to score item for model .
These conditions are not merely aspirational but operationalized through the transformation pipeline, ensuring the evaluation suite remains relevant and sustainable as model cost and capability scale upward (Joshi et al., 5 Jan 2026).
2. Systematic Failure Modes in Legacy VLM Benchmarks
DatBench-Full’s motivation rests on three observed pathologies:
- Multiple-Choice Saturation: Standard N-way MCQs reward guessing, resulting in artificially high baseline scores and "ceiling" effects where gains vanish as models improve. Conversion of AI2D MCQs to open-ended generative tasks drops mean accuracy by 37 percentage points (77.56% to 40.53%), exposing suppressed capability gaps.
- Blind Solvability: Many legacy benchmarks contain items answerable without image context—up to 70% in VQA-v2. By measuring , these items are identified and discarded to ensure multimodality is genuinely required.
- Mislabeled or Ambiguous Items: Up to 42% of examples in certain datasets (e.g., Spatial Reasoning, ChartQA Pro, MMMU-Pro) are found to be mislabeled, ambiguous, or too low-resolution for expert or model consensus. These are flagged and eliminated via judge-model verification (GPT-5.2).
A plausible implication is that prior reported VLM progress may be inflated or misattributed, as evaluation artifacts did not reliably probe the intended reasoning capabilities (Joshi et al., 5 Jan 2026).
3. Transformation and Filtering Pipeline
DatBench-Full is the result of a four-step data curation protocol:
- MCQ Conversion and Circular Evaluation: MCQs are transformed to generative tasks ("Question → Model Answer"); where MCQs are retained, credit is awarded only if models select the correct answer under all N option orderings ("circular MCQ"), collapsing guessing baselines.
- Blind-Solvable Filtering: Each item is scored by 27 VLMs in text-only mode; any sample answered correctly above a designated threshold is discarded. This filtering typically removes approximately 54% of blind-solvable samples.
- Automated Quality Filtering: Items that all model variants fail are flagged and inspected by a high-capacity VLM (GPT-5.2), which discards those judged ambiguous, mislabeled, or low-resolution.
- Discriminative Subset Selection (DatBench): The efficient subset is built by greedily maximizing under a budget constraint, reserving up to 20% for "frontier" samples (correct, judge-verified items all current models fail), thereby preserving headroom for future model improvements.
The curation pipeline is designed to be "living"—benchmarks can be recurred as capabilities advance, maintaining discriminative pressure and preventing redundancy (Joshi et al., 5 Jan 2026).
4. Dataset Structure, Capability Taxonomy, and Suite Statistics
DatBench-Full spans 33 datasets, partitioned into nine VLM capabilities:
| Capability | Number of Datasets | Sample Discard Rate |
|---|---|---|
| Chart Understanding | Included | Up to 17.2% |
| Document Understanding | Included | 5.13% |
| Scene OCR | Included | Not specified |
| Math & Logic | Included | Not specified |
| Spatial Reasoning | Included | 42.07% |
| Grounding | Included | Not specified |
| Counting | Included | 41.6% |
| Diagrams & Tables | Included | Not specified |
| General VQA | Included | 72.07% (VQA-v2) |
The aggregate post-filter suite comprises approximately 460,000 cleaned examples (after the elimination of 3% mislabeled and 54% blind-solvable samples), offering full coverage for deep-dive analysis. DatBench, the efficient subset, retains of discriminative power while requiring only of the compute, achieving empirical speedups averaging (up to for some capabilities) (Joshi et al., 5 Jan 2026).
5. Empirical Impacts of Data Transformation
The transformation steps have quantifiable effects:
- MCQ→Generative: In AI2D, converting MCQs to generative drops mean accuracy from 77.56% to 40.53%. Weaker models experience up to 50% relative loss; the gap narrows among frontier models with >80% baseline MCQ accuracy.
- Blind-Solvable Filtering: In VQA-v2, 72.07% of samples are filtered out, lowering mean accuracy by ∼20 percentage points. Counting and Spatial Reasoning see 41.6% and 19% sample discard rates, respectively.
- Quality Filtering: Up to 42.07% of Spatial items are filtered due to ambiguity or ground-truth defects; Document Understanding loses 5.13% to mislabeling.
- Efficiency Gains: End-to-end evaluation of the full suite on a single 8B-parameter model can exceed 40 H100 GPU-hours; DatBench reduces this by 13-fold on average, lowering the compute barrier for model development and testing.
The discriminative subset selection (DatBench) enables 0.90 × Full discrimination with only 40% of Document data, whereas random sampling delivers only 0.45 × at equal budget. Other capabilities reach at least 0.80 × Full discrimination with as little as 20% of data (Joshi et al., 5 Jan 2026).
6. Recommendations and Future Directions in VLM Evaluation
DatBench-Full establishes that rigorous evaluation is an active data curation process rather than a static protocol. The suite’s recommendations include:
- Favor generative, open-ended evaluation over MCQs; where MCQs are unavoidable, use circular scoring to nullify guessing.
- Enforce multimodality by discarding blind-solvable samples based on empirical measurement across diverse models.
- Apply high-capacity verifier models to eliminate mislabeled and ambiguous instances at scale.
- Select subsets for discrimination, not just rank-correlation, ensuring sensitivity to capability improvements while sharply reducing compute cost.
- Design living benchmarks with periodic re-curation and explicit diversity strategies to maintain evaluative pressure and avoid redundancy.
- Release both comprehensive (DatBench-Full) and efficient (DatBench) suites to serve the needs of deep system analysis and rapid development.
This suggests that sustainable, faithful, and discriminative evaluation is possible only through continual data curation driven by empirical pathology identification and corrective transformation at scale (Joshi et al., 5 Jan 2026).
7. Contextual Significance and Adoption
DatBench-Full’s release marks a shift in VLM evaluation from passive reporting to active benchmark engineering, directly confronting legacy shortcomings—multiple-choice contamination, blind text-only solvability, and annotation errors. By focusing equally on faithfulness to vision-language reasoning, discriminative granularity, and pragmatic compute management, DatBench-Full sets a new standard for fair comparison and transparent progress in foundation models. A plausible implication is the necessity for the broader research community to treat benchmark suite design itself as an ongoing process, adapting its specifications as model capabilities and costs evolve.
By unifying modality-faithful design, discriminative power, and efficiency, DatBench-Full outlines an authoritative path toward rigorous VLM evaluation.