VSI-Bench-Debiased: Enhancing Vision Evaluation

Updated 10 November 2025

VSI-Bench-Debiased is a vision-centric benchmark that systematically removes non-visual shortcuts via iterative bias pruning.
It employs a Test-set Stress-Test (TsT) and Random Forest analysis to diagnose and minimize reliance on textual cues in model evaluations.
The refined benchmark enhances model assessments by increasing the gap between vision-informed and text-only performances in multimodal reasoning.

VSI-Bench-Debiased is a vision-centric benchmark specifically constructed to address and mitigate non-visual shortcuts present in the original VSI-Bench. Developed via a systematic diagnostic and pruning framework, VSI-Bench-Debiased aims to provide a more faithful assessment of vision-driven reasoning capabilities in Multimodal LLMs (MLLMs) by reducing the potential for models to exploit textual or statistical biases intrinsic to the test set.

1. Diagnostic Principle and Motivation

The impetus for constructing VSI-Bench-Debiased stems from findings that many multimodal benchmarks, including vision-focused ones, are susceptible to exploitation via non-visual cues—allowing MLLMs to perform well even without true visual understanding. This phenomenon is particularly problematic in evaluation settings purported to require vision-centric reasoning. The underlying diagnostic principle is: if a model can "solve" a task using only non-visual information (question and answer text), then the benchmark is vulnerable to shortcut solving and fails to exclusively test the intended visual competencies. To systematically identify and address such vulnerabilities, the design process for VSI-Bench-Debiased mandates explicit probing and removal of high-bias samples based on model-predicted text-only answerability.

2. Test-set Stress-Test (TsT) Methodology

Central to the diagnostic workflow is the Test-set Stress-Test (TsT) methodology, which quantifies the degree to which visual questions can be answered using only textual cues. The TsT protocol is characterized by the following components:

k-Fold Cross-Validation on Textual Inputs: The test set $D$ (comprising $N$ samples) is partitioned into $k$ folds. For each held-out fold $D_i$ , a diagnostic model $M_i$ is fine-tuned on the remaining $(D \setminus D_i)$ samples using exclusively the non-visual inputs (question and choices). The model is then evaluated on $D_i$ .
Sample-Level Bias Scores: For each sample $x \in D_i$ , the diagnostic model outputs the probability $s_i(x) = P_{M_i}(y_{\text{true}}(x) | \text{text}(x))$ . Each sample is assigned the score $s(x) = s_i(x)$ corresponding to its fold.
Global Non-Visual Solvability (TsT Accuracy): The aggregated proportion of correctly predicted answers using text alone,

$\text{TsT accuracy} = \frac{1}{N} \sum_{i=1}^k \sum_{x \in D_i} \mathbb{1}\left[ \hat{y}_i(x) = y_{\text{true}}(x) \right]\,,$

serves as an upper bound on the model’s performance absent visual input.

The principal instantiation of TsT leverages the Qwen2.5-7B-Instruct LLM, fine-tuned via LoRA with rank $r = 128$ , $\alpha = 256$ , learning rate $5 \times 10^{-5}$ (cosine decay), batch size 32, and trained for three epochs per fold on 4 $\times$ A100 GPUs (requiring approximately 20 minutes per benchmark for $k=5$ ).

3. Random Forest-Based Diagnostic Analysis

To complement the LLM-based TsT diagnostic, a lightweight Random Forest-based audit is conducted for rapid, interpretable examination of non-visual shortcut features. This diagnostic uses scikit-learn’s RandomForestClassifier with $1{,}000$ estimators and max depth 20 under the same 5-fold CV regime.

Feature Set:

Textual features: TF-IDF vectors, question length, spatial keyword counts
Answer-space features: Number of options, per-option statistics (e.g., distance from global mean for MCQ tasks)
Task-specific features: For object counting, instance category frequency and log-moments; for size estimation, average log-size and standard deviation per category; for spatial relations and appearance order, frequency and occurrence patterns of relevant entities
Metadata: Question type, object categories

Interpretability: Feature importance values (Gini importance) are used to pinpoint which statistical cues primarily drive shortcut performance. For example, in Object Size Estimation, the feature representing average object size ( $\text{obj\_val\_log\_mean}$ ) can account for as much as $0.968$ of the model's shortcut prediction ability, indicating severe susceptibility to category-level priors.

4. Iterative Bias Pruning (IBP) Procedure

The construction of VSI-Bench-Debiased relies on the Iterative Bias Pruning (IBP) protocol, which iteratively removes the most textually-exploitable samples:

IBP Pseudocode

Input: Dataset D, ComputeSampleBiasScores(D) → {s(x)},
       Removal budget B, Batch size b, Threshold τ
removed ← 0
While removed < B:
    {s(x)} ← ComputeSampleBiasScores(D)
    If maxₓ s(x) ≤ τ: break
    I ← top-b samples by s(x) (with b adjusted for final batch)
    D ← D \ I
    removed ← removed + |I|
Return D as D'

Key Parameters and Strategy:

Removal always targets samples with the highest $s(x)$ in each batch.
Early stopping is triggered when all remaining samples satisfy $s(x)\leq\tau$ , with $\tau$ set slightly above chance (e.g., $\tau=0.6$ ).
For VSI-Bench-Debiased, parameters were $B=937$ (approximately 30.7% of the dataset), $b=100$ , and $\tau=0.6$ .

A plausible implication is that early stopping with $\tau = 0.6$ ensures that after pruning, no residual question can be answered by the text-only model with excessive confidence, thus reducing non-visual exploitability.

5. Dataset Composition and Bias Reduction

Dataset Statistics:

Task	Original	Removed	% Removed	Final
Object Counting	764	281	36.8%	483
Object Size Estimation	764	243	31.8%	521
Spatial Relation	764	213	27.9%	551
Appearance Order	764	200	26.2%	564
Total	3,056	937	30.7%	2,119

Prior to pruning, bias scores $s(x)$ exhibited a heavy upper tail ( $>200$ samples with $s(x)\!>\!0.8$ ); after IBP, $s(x)$ becomes tightly concentrated in the $[0.5,0.6]$ interval, indicating the effective removal of trivially answerable cases.

Examples:

Removed (high-bias):

Size Estimation—“What is the length of the longest dimension of the dishwasher?” (log-mean $\approx \ln(90\ \text{cm})$ , $\sigma \approx 3$ cm, $s(x)\!=\!0.95$ )

Retained (low-bias):

Size Estimation—“What is the length of the longest dimension of the ceiling light?” (log-mean = $\ln(72$ cm), $\sigma=56$ cm, $s(x)=0.52$ )

This selection paradigm is consistently applied across other task types such as object counting and spatial relations, whereby canonically “obvious” or statistically predictable cases are pruned.

6. Evaluation and Benchmark Implications

Comprehensive evaluation using LLaVA-Video-7B quantifies the impact of bias pruning on both vision-only (Vis.) and vision-blind (Blind; text-only) models, as well as the vision-blind gap $\Delta(\text{V--B})$ :

Model Configuration	Original VSI-Bench	VSI-Bench-Debiased
	Vis.	Blind
Base (zero FT)	36.7	25.9
+ VSI-Train-10k fine-tune (3 epochs)	57.1	44.7
─ Improvement from FT	+20.4	+18.8

Key Observations:

Blind accuracy after fine-tuning drops from 44.7% (original) to 32.0% (debiased), while vision-informed performance maintains a wider margin.
The vision-blind gap increases from 12.4 percentage points (pp) in the original to 16.6 pp after bias pruning, indicating that gains attributable to genuine visual processing are now more discriminative relative to text-based shortcuts.
The base model’s blind performance shows a decrease of 5.6 pp (25.9→20.3) on the debiased set, reinforcing successful bias mitigation.

A plausible implication is that such bias reduction sharpens the diagnostic power of the benchmark, making performance improvements reliant on visual input rather than on dataset artifacts or LLM priors.

7. Significance and Research Impact

The development of VSI-Bench-Debiased embodies a systematic approach to diagnosing and mitigating non-visual shortcuts in vision-centric evaluation. By integrating the TsT methodology, interpretable Random Forest audit, and iterative sample pruning, it establishes a principled framework for benchmark designers to proactively fortify datasets against superficial exploitability.

VSI-Bench-Debiased retains approximately 70% of the original VSI-Bench samples, but through aggressive pruning of high-bias instances, it substantially reduces the shortcut opportunity space for multimodal models. The resultant benchmark more faithfully measures genuine vision-driven reasoning, supporting more robust and informative model comparisons and facilitating downstream research in benchmark design, evaluation methodology, and MLLM robustness.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VSI-Bench-Debiased.