DatBench: VLM Evaluation Suite
- DatBench is an evaluation suite for vision-language models featuring rigorous filtering, psychometric selection, and discriminative subset methodologies.
- It achieves computational efficiency with a speedup of up to 50× while retaining over 90% of the original discriminative power.
- The framework systematically corrects legacy benchmark issues such as MCQ inflation and blind-solvability to enable high-fidelity empirical evaluations.
DatBench is a rigorously curated evaluation suite for Vision-LLMs (VLMs), designed to enable discriminative, faithful, and computationally efficient assessment of multimodal model capabilities. It systematically rectifies artifacts and weaknesses present in legacy benchmarks by applying psychometric selection, generative transformation, blind-solvability filtering, and robust data cleaning, establishing a comprehensive framework for high-fidelity empirical evaluation as VLMs continue to scale (Joshi et al., 5 Jan 2026).
1. Definition, Releases, and Scope
DatBench comprises two principal artifacts: DatBench-Full (“BB-Full”), a cleaned amalgamation of 33 VLM evaluation datasets for exhaustive reporting, and DatBench, an efficiency-optimized discriminative subset. BB-Full spans nine VLM capability domains:
| Capability Domain | Representative Datasets |
|---|---|
| Chart Understanding | ChartQA, ChartQA Pro, CharXiv, InfoVQA |
| Document Understanding | CC-OCR KIE, OCR-VQA, OCRBench-V2, DocVQA |
| Scene OCR | TextVQA, MME-RW OCR, CC-OCR multi-scene |
| Math/Logic | MathVista, MathVerse, MathVision, LogicVista |
| Spatial Reasoning | RealWorldQA, MME-RW Autonomous Driving |
| Grounding | RefCOCO, RefCOCO+, RefCOCO-g, RefCOCO-M, Pixmo-Point |
| Counting | CountBench, TallyQA |
| Diagrams/Tables | AI2D, MME-RW Diagram |
| General VQA | MMMU-Pro, MMBench, VQA-v2 |
DatBench is constructed by retaining the top 20–40% of the most discriminative and high-quality items from each dataset. This subset preserves >90% of the original discriminative power, yet achieves average evaluation speedup of 13× (and up to 50× for select capabilities) (Joshi et al., 5 Jan 2026).
2. Evaluation Principles and Quantitative Metrics
DatBench operationalizes three desiderata for model evaluations:
- Faithfulness: Benchmarks must require genuine visual reasoning and reflect authentic model deployment contexts (prefer open-ended, image-dependent queries over multiple-choice (MCQ)).
- Metric: Vision delta (), calculated as . High signifies notable accuracy degradation when images are withheld, indicating true modality dependence.
- Discriminability: Items should maximize separation between strong and weak models.
- Metric: Point-biserial correlation for item , defined as
where and are mean scores for correct and incorrect responses respectively, and , are their proportions; total suite discriminative power is .
- Efficiency: Maximize discriminative signal per unit compute.
- Metric: Speedup , measured by H100-GPU hours; DatBench achieves on average.
Faithfulness filtering discards samples solvable by language priors above dataset-specific thresholds, leveraging blind-solvability metrics computed over 27 VLMs (Joshi et al., 5 Jan 2026).
3. Systematic Correction of Benchmark Pathologies
DatBench addresses pervasive failure modes documented in legacy VLM evaluation workflows:
- MCQ Inflation and Saturation: MCQ formats reward guessing and induce early score saturation, misrepresenting actual model capacities. The MCQ chance baseline inflates scores by 20–30%. For example, AI2D MCQ accuracy of 77.6% drops to 40.5% under generative evaluation (–37pp).
- Blindly-Solvable Samples: Items answerable via language priors (image-invariant) constitute up to 70% of some benchmarks (e.g., VQA-v2), undermining visual faithfulness. These include factual knowledge, visual stereotypes, and symbolic puzzles.
- Mislabeled/Ambiguous Ground Truth: Low resolution, ambiguous wording, or label errors affect up to 42% of certain datasets (e.g., 43.9% of MME-RW Autonomous Driving). These are systematically identified and purged via judge-based verification.
- Computational Overload: Evaluating full-scale suites with LLMs may consume >20% of total development compute—Qwen3-VL “Thinking” models use ~14× more tokens for incorrect answers.
Transformation of MCQ items to open-ended generative evaluation, combined with circular evaluation for option-based tasks, strips out chance floors and exposes true model deficiencies.
4. Filtering Pipeline and Discriminative Subset Selection
DatBench employs a multi-stage pipeline:
- MCQ-to-Generative Conversion: Removal of all answer options, generative response required; circular evaluation applied for tasks inherently option-based.
- Blind-Solvability Filtering: Text-only inference across 27 models; discards samples solved above threshold τ for dataset type (e.g., 54% of VQA-v2, 48% of Chart, 42% of Counting eliminated).
- Judge-Based Quality Filtering: Items with unanimous failure are manually adjudicated by a robust VLM judge; ambiguous, mislabeled, or low-res items are excised (up to 42% purged for Spatial).
- Discriminative Subset Selection: calculated over a model grid; greedy selection maximizes up to token-budget constraint, with up to 20% “frontier” unsolved items retained to preserve evaluation headroom.
Each step is quantitatively justified; e.g., generative conversion exposes capability drops up to 35pp (AI2D), blind-solvability filtering shifts General VQA scores down by 72pp, and the final subset retains >90% at 13× speedup (Joshi et al., 5 Jan 2026).
5. Experimental Setup and Comparative Analysis
DatBench’s evaluation protocol spans 27 frontier models: Qwen2.5/3-VL, InternVL2/2.5/3/3.5, GLM-4.1V, R-4B, SmolVLM2, Phi-3.5 vision, Gemma-3 (scale: 2B–10B parameters). Model outputs are capped at 4096 tokens, scored by dedicated LLM judges using semantic equivalence approaches [Chandak et al. 2025, as cited in (Joshi et al., 5 Jan 2026)]. Metrics computed include , , rank correlation via Spearman’s ρ ( achieved with <10% of items), and speedup .
Quantitative findings reveal domain-dependent vision delta effects: Counting (60.2%) and Grounding (42.3%) are highly vision-dependent; Math (13.0%) and Spatial (14.9%) are confounded by language priors. Reasoning-oriented tasks (Chart, Math, General VQA) form a correlated cluster (), negatively correlating with perceptual tasks (OCR, Spatial; ). “Thinking” models gain 36.8% on Math but lose 53.5% on OCR at a 14× penalty in token cost.
6. Broader Implications for VLM Evaluation and Benchmarking
DatBench codifies efficient, rigorously discriminative, and faithful VLM evaluation practices. By transforming, filtering, and carefully selecting items from existing benchmarks, it produces a high-resolution evaluation suite that accurately reflects multimodal model strengths and weaknesses while optimizing for compute sustainability. The twin artifacts—DatBench-Full for exhaustive reporting and DatBench for rapid iteration—offer benchmarking consistency, reliability, and operational practicality, facilitating scalable VLM development and cross-model comparisons as underlying architectures and data modalities progress (Joshi et al., 5 Jan 2026).
This approach demonstrates that random sampling of evaluation items yields less than half the discriminative power at equal budget compared to psychometric selection, and that rank-only selection is a weak criterion for model comparison. A plausible implication is that such principled suite design will become standard in future foundation model evaluation protocols, as the scaling of model size and data modalities intensifies the need for robust, cost-effective evaluations.