Papers
Topics
Authors
Recent
2000 character limit reached

DatBench: VLM Evaluation Suite

Updated 6 January 2026
  • DatBench is an evaluation suite for vision-language models featuring rigorous filtering, psychometric selection, and discriminative subset methodologies.
  • It achieves computational efficiency with a speedup of up to 50× while retaining over 90% of the original discriminative power.
  • The framework systematically corrects legacy benchmark issues such as MCQ inflation and blind-solvability to enable high-fidelity empirical evaluations.

DatBench is a rigorously curated evaluation suite for Vision-LLMs (VLMs), designed to enable discriminative, faithful, and computationally efficient assessment of multimodal model capabilities. It systematically rectifies artifacts and weaknesses present in legacy benchmarks by applying psychometric selection, generative transformation, blind-solvability filtering, and robust data cleaning, establishing a comprehensive framework for high-fidelity empirical evaluation as VLMs continue to scale (Joshi et al., 5 Jan 2026).

1. Definition, Releases, and Scope

DatBench comprises two principal artifacts: DatBench-Full (“BB-Full”), a cleaned amalgamation of 33 VLM evaluation datasets for exhaustive reporting, and DatBench, an efficiency-optimized discriminative subset. BB-Full spans nine VLM capability domains:

Capability Domain Representative Datasets
Chart Understanding ChartQA, ChartQA Pro, CharXiv, InfoVQA
Document Understanding CC-OCR KIE, OCR-VQA, OCRBench-V2, DocVQA
Scene OCR TextVQA, MME-RW OCR, CC-OCR multi-scene
Math/Logic MathVista, MathVerse, MathVision, LogicVista
Spatial Reasoning RealWorldQA, MME-RW Autonomous Driving
Grounding RefCOCO, RefCOCO+, RefCOCO-g, RefCOCO-M, Pixmo-Point
Counting CountBench, TallyQA
Diagrams/Tables AI2D, MME-RW Diagram
General VQA MMMU-Pro, MMBench, VQA-v2

DatBench is constructed by retaining the top 20–40% of the most discriminative and high-quality items from each dataset. This subset preserves >90% of the original discriminative power, yet achieves average evaluation speedup of 13× (and up to 50× for select capabilities) (Joshi et al., 5 Jan 2026).

2. Evaluation Principles and Quantitative Metrics

DatBench operationalizes three desiderata for model evaluations:

  • Faithfulness: Benchmarks must require genuine visual reasoning and reflect authentic model deployment contexts (prefer open-ended, image-dependent queries over multiple-choice (MCQ)).
    • Metric: Vision delta (VΔV_{\Delta}), calculated as VΔ=AccmultimodalAcctext-onlyV_{\Delta} = \text{Acc}_{\text{multimodal}} - \text{Acc}_{\text{text-only}}. High VΔV_{\Delta} signifies notable accuracy degradation when images are withheld, indicating true modality dependence.
  • Discriminability: Items should maximize separation between strong and weak models.
    • Metric: Point-biserial correlation rpb(i)r_{pb}(i) for item ii, defined as

rpb(i)=μ1μ0σpiqir_{pb}(i) = \frac{\mu_{1} - \mu_{0}}{\sigma} \sqrt{p_iq_i}

where μ1\mu_1 and μ0\mu_0 are mean scores for correct and incorrect responses respectively, and pip_i, qiq_i are their proportions; total suite discriminative power is D(S)=iSrpb(i)D(S) = \sum_{i\in S} r_{pb}(i).

  • Efficiency: Maximize discriminative signal per unit compute.
    • Metric: Speedup S=CostfullCostsubsetS = \frac{\text{Cost}_{\text{full}}}{\text{Cost}_{\text{subset}}}, measured by H100-GPU hours; DatBench achieves S13S \approx 13 on average.

Faithfulness filtering discards samples solvable by language priors above dataset-specific thresholds, leveraging blind-solvability metrics computed over 27 VLMs (Joshi et al., 5 Jan 2026).

3. Systematic Correction of Benchmark Pathologies

DatBench addresses pervasive failure modes documented in legacy VLM evaluation workflows:

  • MCQ Inflation and Saturation: MCQ formats reward guessing and induce early score saturation, misrepresenting actual model capacities. The MCQ chance baseline inflates scores by 20–30%. For example, AI2D MCQ accuracy of 77.6% drops to 40.5% under generative evaluation (–37pp).
  • Blindly-Solvable Samples: Items answerable via language priors (image-invariant) constitute up to 70% of some benchmarks (e.g., VQA-v2), undermining visual faithfulness. These include factual knowledge, visual stereotypes, and symbolic puzzles.
  • Mislabeled/Ambiguous Ground Truth: Low resolution, ambiguous wording, or label errors affect up to 42% of certain datasets (e.g., 43.9% of MME-RW Autonomous Driving). These are systematically identified and purged via judge-based verification.
  • Computational Overload: Evaluating full-scale suites with LLMs may consume >20% of total development compute—Qwen3-VL “Thinking” models use ~14× more tokens for incorrect answers.

Transformation of MCQ items to open-ended generative evaluation, combined with circular evaluation for option-based tasks, strips out chance floors and exposes true model deficiencies.

4. Filtering Pipeline and Discriminative Subset Selection

DatBench employs a multi-stage pipeline:

  1. MCQ-to-Generative Conversion: Removal of all answer options, generative response required; circular evaluation applied for tasks inherently option-based.
  2. Blind-Solvability Filtering: Text-only inference across 27 models; discards samples solved above threshold τ for dataset type (e.g., 54% of VQA-v2, 48% of Chart, 42% of Counting eliminated).
  3. Judge-Based Quality Filtering: Items with unanimous failure are manually adjudicated by a robust VLM judge; ambiguous, mislabeled, or low-res items are excised (up to 42% purged for Spatial).
  4. Discriminative Subset Selection: rpb(i)r_{pb}(i) calculated over a model grid; greedy selection maximizes D(S)D(S) up to token-budget constraint, with up to 20% “frontier” unsolved items retained to preserve evaluation headroom.

Each step is quantitatively justified; e.g., generative conversion exposes capability drops up to 35pp (AI2D), blind-solvability filtering shifts General VQA scores down by 72pp, and the final subset retains >90% DtotalD_{\text{total}} at 13× speedup (Joshi et al., 5 Jan 2026).

5. Experimental Setup and Comparative Analysis

DatBench’s evaluation protocol spans 27 frontier models: Qwen2.5/3-VL, InternVL2/2.5/3/3.5, GLM-4.1V, R-4B, SmolVLM2, Phi-3.5 vision, Gemma-3 (scale: 2B–10B parameters). Model outputs are capped at 4096 tokens, scored by dedicated LLM judges using semantic equivalence approaches [Chandak et al. 2025, as cited in (Joshi et al., 5 Jan 2026)]. Metrics computed include VΔV_{\Delta}, rpbr_{pb}, rank correlation via Spearman’s ρ (>0.95>0.95 achieved with <10% of items), and speedup SS.

Quantitative findings reveal domain-dependent vision delta effects: Counting (60.2%) and Grounding (42.3%) are highly vision-dependent; Math (13.0%) and Spatial (14.9%) are confounded by language priors. Reasoning-oriented tasks (Chart, Math, General VQA) form a correlated cluster (r0.76r \geq 0.76), negatively correlating with perceptual tasks (OCR, Spatial; r0.2r \approx –0.2). “Thinking” models gain 36.8% on Math but lose 53.5% on OCR at a 14× penalty in token cost.

6. Broader Implications for VLM Evaluation and Benchmarking

DatBench codifies efficient, rigorously discriminative, and faithful VLM evaluation practices. By transforming, filtering, and carefully selecting items from existing benchmarks, it produces a high-resolution evaluation suite that accurately reflects multimodal model strengths and weaknesses while optimizing for compute sustainability. The twin artifacts—DatBench-Full for exhaustive reporting and DatBench for rapid iteration—offer benchmarking consistency, reliability, and operational practicality, facilitating scalable VLM development and cross-model comparisons as underlying architectures and data modalities progress (Joshi et al., 5 Jan 2026).

This approach demonstrates that random sampling of evaluation items yields less than half the discriminative power at equal budget compared to psychometric selection, and that rank-only selection is a weak criterion for model comparison. A plausible implication is that such principled suite design will become standard in future foundation model evaluation protocols, as the scaling of model size and data modalities intensifies the need for robust, cost-effective evaluations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DatBench.