VSI-Bench-Debiased: Enhancing Vision Evaluation
- VSI-Bench-Debiased is a vision-centric benchmark that systematically removes non-visual shortcuts via iterative bias pruning.
- It employs a Test-set Stress-Test (TsT) and Random Forest analysis to diagnose and minimize reliance on textual cues in model evaluations.
- The refined benchmark enhances model assessments by increasing the gap between vision-informed and text-only performances in multimodal reasoning.
VSI-Bench-Debiased is a vision-centric benchmark specifically constructed to address and mitigate non-visual shortcuts present in the original VSI-Bench. Developed via a systematic diagnostic and pruning framework, VSI-Bench-Debiased aims to provide a more faithful assessment of vision-driven reasoning capabilities in Multimodal LLMs (MLLMs) by reducing the potential for models to exploit textual or statistical biases intrinsic to the test set.
1. Diagnostic Principle and Motivation
The impetus for constructing VSI-Bench-Debiased stems from findings that many multimodal benchmarks, including vision-focused ones, are susceptible to exploitation via non-visual cues—allowing MLLMs to perform well even without true visual understanding. This phenomenon is particularly problematic in evaluation settings purported to require vision-centric reasoning. The underlying diagnostic principle is: if a model can "solve" a task using only non-visual information (question and answer text), then the benchmark is vulnerable to shortcut solving and fails to exclusively test the intended visual competencies. To systematically identify and address such vulnerabilities, the design process for VSI-Bench-Debiased mandates explicit probing and removal of high-bias samples based on model-predicted text-only answerability.
2. Test-set Stress-Test (TsT) Methodology
Central to the diagnostic workflow is the Test-set Stress-Test (TsT) methodology, which quantifies the degree to which visual questions can be answered using only textual cues. The TsT protocol is characterized by the following components:
- k-Fold Cross-Validation on Textual Inputs: The test set (comprising samples) is partitioned into folds. For each held-out fold , a diagnostic model is fine-tuned on the remaining samples using exclusively the non-visual inputs (question and choices). The model is then evaluated on .
- Sample-Level Bias Scores: For each sample , the diagnostic model outputs the probability . Each sample is assigned the score corresponding to its fold.
- Global Non-Visual Solvability (TsT Accuracy): The aggregated proportion of correctly predicted answers using text alone,
serves as an upper bound on the model’s performance absent visual input.
The principal instantiation of TsT leverages the Qwen2.5-7B-Instruct LLM, fine-tuned via LoRA with rank , , learning rate (cosine decay), batch size 32, and trained for three epochs per fold on 4A100 GPUs (requiring approximately 20 minutes per benchmark for ).
3. Random Forest-Based Diagnostic Analysis
To complement the LLM-based TsT diagnostic, a lightweight Random Forest-based audit is conducted for rapid, interpretable examination of non-visual shortcut features. This diagnostic uses scikit-learn’s RandomForestClassifier with estimators and max depth 20 under the same 5-fold CV regime.
Feature Set:
- Textual features: TF-IDF vectors, question length, spatial keyword counts
- Answer-space features: Number of options, per-option statistics (e.g., distance from global mean for MCQ tasks)
- Task-specific features: For object counting, instance category frequency and log-moments; for size estimation, average log-size and standard deviation per category; for spatial relations and appearance order, frequency and occurrence patterns of relevant entities
- Metadata: Question type, object categories
Interpretability: Feature importance values (Gini importance) are used to pinpoint which statistical cues primarily drive shortcut performance. For example, in Object Size Estimation, the feature representing average object size () can account for as much as $0.968$ of the model's shortcut prediction ability, indicating severe susceptibility to category-level priors.
4. Iterative Bias Pruning (IBP) Procedure
The construction of VSI-Bench-Debiased relies on the Iterative Bias Pruning (IBP) protocol, which iteratively removes the most textually-exploitable samples:
IBP Pseudocode
1 2 3 4 5 6 7 8 9 10 |
Input: Dataset D, ComputeSampleBiasScores(D) → {s(x)},
Removal budget B, Batch size b, Threshold τ
removed ← 0
While removed < B:
{s(x)} ← ComputeSampleBiasScores(D)
If maxₓ s(x) ≤ τ: break
I ← top-b samples by s(x) (with b adjusted for final batch)
D ← D \ I
removed ← removed + |I|
Return D as D' |
Key Parameters and Strategy:
- Removal always targets samples with the highest in each batch.
- Early stopping is triggered when all remaining samples satisfy , with set slightly above chance (e.g., ).
- For VSI-Bench-Debiased, parameters were (approximately 30.7% of the dataset), , and .
A plausible implication is that early stopping with ensures that after pruning, no residual question can be answered by the text-only model with excessive confidence, thus reducing non-visual exploitability.
5. Dataset Composition and Bias Reduction
Dataset Statistics:
| Task | Original | Removed | % Removed | Final |
|---|---|---|---|---|
| Object Counting | 764 | 281 | 36.8% | 483 |
| Object Size Estimation | 764 | 243 | 31.8% | 521 |
| Spatial Relation | 764 | 213 | 27.9% | 551 |
| Appearance Order | 764 | 200 | 26.2% | 564 |
| Total | 3,056 | 937 | 30.7% | 2,119 |
Prior to pruning, bias scores exhibited a heavy upper tail ( samples with ); after IBP, becomes tightly concentrated in the interval, indicating the effective removal of trivially answerable cases.
Examples:
- Removed (high-bias):
Size Estimation—“What is the length of the longest dimension of the dishwasher?” (log-mean , cm, )
- Retained (low-bias):
Size Estimation—“What is the length of the longest dimension of the ceiling light?” (log-mean = cm), cm, )
This selection paradigm is consistently applied across other task types such as object counting and spatial relations, whereby canonically “obvious” or statistically predictable cases are pruned.
6. Evaluation and Benchmark Implications
Comprehensive evaluation using LLaVA-Video-7B quantifies the impact of bias pruning on both vision-only (Vis.) and vision-blind (Blind; text-only) models, as well as the vision-blind gap :
| Model Configuration | Original VSI-Bench | VSI-Bench-Debiased |
|---|---|---|
| Vis. | Blind | |
| Base (zero FT) | 36.7 | 25.9 |
| + VSI-Train-10k fine-tune (3 epochs) | 57.1 | 44.7 |
| ─ Improvement from FT | +20.4 | +18.8 |
Key Observations:
- Blind accuracy after fine-tuning drops from 44.7% (original) to 32.0% (debiased), while vision-informed performance maintains a wider margin.
- The vision-blind gap increases from 12.4 percentage points (pp) in the original to 16.6 pp after bias pruning, indicating that gains attributable to genuine visual processing are now more discriminative relative to text-based shortcuts.
- The base model’s blind performance shows a decrease of 5.6 pp (25.9→20.3) on the debiased set, reinforcing successful bias mitigation.
A plausible implication is that such bias reduction sharpens the diagnostic power of the benchmark, making performance improvements reliant on visual input rather than on dataset artifacts or LLM priors.
7. Significance and Research Impact
The development of VSI-Bench-Debiased embodies a systematic approach to diagnosing and mitigating non-visual shortcuts in vision-centric evaluation. By integrating the TsT methodology, interpretable Random Forest audit, and iterative sample pruning, it establishes a principled framework for benchmark designers to proactively fortify datasets against superficial exploitability.
VSI-Bench-Debiased retains approximately 70% of the original VSI-Bench samples, but through aggressive pruning of high-bias instances, it substantially reduces the shortcut opportunity space for multimodal models. The resultant benchmark more faithfully measures genuine vision-driven reasoning, supporting more robust and informative model comparisons and facilitating downstream research in benchmark design, evaluation methodology, and MLLM robustness.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free