MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs (2511.14159v1)

Published 18 Nov 2025 in cs.CV

Abstract: Evaluating the robustness of Large Vision-LLMs (LVLMs) is essential for their continued development and responsible deployment in real-world applications. However, existing robustness benchmarks typically focus on hallucination or misleading textual inputs, while largely overlooking the equally critical challenge posed by misleading visual inputs in assessing visual understanding. To fill this important gap, we introduce MVI-Bench, the first comprehensive benchmark specially designed for evaluating how Misleading Visual Inputs undermine the robustness of LVLMs. Grounded in fundamental visual primitives, the design of MVI-Bench centers on three hierarchical levels of misleading visual inputs: Visual Concept, Visual Attribute, and Visual Relationship. Using this taxonomy, we curate six representative categories and compile 1,248 expertly annotated VQA instances. To facilitate fine-grained robustness evaluation, we further introduce MVI-Sensitivity, a novel metric that characterizes LVLM robustness at a granular level. Empirical results across 18 state-of-the-art LVLMs uncover pronounced vulnerabilities to misleading visual inputs, and our in-depth analyses on MVI-Bench provide actionable insights that can guide the development of more reliable and robust LVLMs. The benchmark and codebase can be accessed at https://github.com/chenyil6/MVI-Bench.

Summary

The paper introduces MVI-Bench, a benchmark that evaluates LVLM robustness using paired VQA instances with misleading visual inputs.
It employs a hierarchical taxonomy of visual concepts, attributes, and relationships along with the MVI-Sensitivity metric to measure performance drops.
Evaluation of 18 LVLMs reveals significant accuracy degradation, underscoring the need for enhanced visual perception and causal reasoning.

MVI-Bench: Evaluating LVLM Robustness to Misleading Visual Inputs

Introduction

Robustness of Large Vision-LLMs (LVLMs) to misleading visual cues remains a critical limitation for real-world deployment. "MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs" (2511.14159) introduces MVI-Bench, a benchmark specifically designed to evaluate LVLM performance under misleading visual conditions—a domain largely neglected by prior works that have focused instead on textual hallucination or adversarial attacks. MVI-Bench employs a hierarchical taxonomy rooted in visual primitives: concept, attribute, and relationship, encompassing six representative misleading categories and 1,248 curated VQA instances. The paper further introduces MVI-Sensitivity, a metric quantifying fine-grained robustness degradation induced by misleading visual inputs, and conducts extensive evaluation of 18 state-of-the-art open-source and closed-source LVLMs to uncover fundamental vulnerabilities and diagnostic insights.

Figure 1: Overview of misleading input types: (a) misleading textual queries; (b) misleading visual cues that induce model errors (e.g., stools mistaken for mushrooms).

Benchmark Design and Taxonomy

MVI-Bench is constructed from carefully paired VQA instances—each consisting of a normal image and its misleading counterpart, sharing near-identical semantics but differing only by the injection of subtle misleading visual cues. Six misleading categories are defined by grounding the taxonomy in three hierarchical visual levels:

Visual Concept Level:
- Visual Resemblance: Confusion of semantically distinct objects with similar appearance.
- Representation Confusion: Failure to distinguish real-world objects from two-dimensional representations.
Visual Attribute Level:
- Material Confusion: Ambiguity in identifying objects with similar textures or materials.
Visual Relationship Level:
- Mirror Reflection: Misattribution of virtual objects as real due to reflections.
- Occlusion Confusion: Errors in identifying/counting objects due to partial occlusion.
- Visual Illusion: Susceptibility to optical illusions arising from geometry or context.

Each category is represented by a set of balanced instances collected from three sources (natural, synthetic, and expert-edited images) to cover diverse domains. This granular taxonomy enables detailed categorization and controlled analysis of LVLM robustness.

Figure 2: Example normal/misleading image pairs from all categories; distractor choices are constructed to exploit misleading cues.

Figure 3: Benchmark composition statistics: balanced categories, image source diversity, broad topical coverage, and high semantic similarity across image pairs.

Evaluation Protocol and Metric

MVI-Bench employs two metrics for model evaluation: raw accuracy on normal and misleading images, and MVI-Sensitivity—defined as the normalized accuracy drop incurred by misleading cues. Lower MVI-Sensitivity indicates stronger robustness. Systematic filtering and expert review ensure dataset discrimination, discarding trivial cases where models perform equally well under both conditions. Human annotation and verification maintain ground-truth consistency.

Figure 4: Data curation pipeline: sequential stages of image collection, expert annotation, filtering, and human verification.

Empirical Analysis of LVLM Robustness

Overall Performance and Comparative Vulnerabilities

Across 18 LVLMs, substantial vulnerabilities are revealed:

Performance Degradation: All models exhibit pronounced drops when subjected to misleading inputs, with MVI-Sensitivity exceeding 20% for all closed-source models and typically worse for open-source ones. Closed-source models benefit from proprietary data and advanced post-training alignment, but none achieves robust performance under all categories.
Best/Worst Performers: GPT-5-Chat delivers strongest overall accuracy on misleading images (63.78%, sensitivity 23.02%), far surpassing the best open-source model Qwen2-VL-72B (58.17%, sensitivity 31.52%). Open-source models like Molmo remain particularly fragile, with nearly half of responses affected by misleading cues.

Category-Level Insights

Visual Concept Robustness: Most LVLMs handle visual resemblance and representation confusion more reliably than other misleading types; e.g., Qwen2-VL-72B achieves 76.19% accuracy in resemblance cases. However, fine-grained attribute confusion (e.g., material) exposes sharp performance drop, with Gemini-2.5-Pro leading but still limited (66.00% accuracy).
Spatial Reasoning Weaknesses: Relationship-level misleading cues (reflection, occlusion, illusion) dramatically challenge all models; accuracy often drops by more than half. Human-level spatial reasoning remains elusive.
Figure 5: Attention-guided masking reveals spurious correlation—a model erroneously associates a receipt with a book, flipping prediction after masking.

Diagnostic Investigation: Perception and Reasoning

The paper systematically evaluates the contributions of perception and reasoning:

Perceptual Enhancement: Caption-assisted inference—feeding rich auxiliary descriptions generated by stronger models—yields significant gains, demonstrating that perception is the principal bottleneck; performance increases by up to 7.86% when augmenting Qwen2.5-VL-7B with GPT-4.1 captions.
Reasoning Dynamics: Explicit chain-of-thought reasoning and scaling of LLM size improve performance inconsistently. In some open-source models, long-form reasoning paradoxically degrades robustness by overemphasizing irrelevant details and visual forgetting, a phenomenon substantiated by attention trace analysis.
Figure 6: Comparison of direct (non-think) and explicit (think) reasoning: overemphasis on fine-grained details can be detrimental in visually misleading cases.

Counterintuitive Model Behaviors

A minority of cases (~4%) in MVI-Bench reveal models that succeed on misleading but fail on normal images—often due to unintentional exploitation of spurious correlations between distractor cues and labels. Detailed attention visualization confirms shortcut behavior, tracing errors to weak supervision paradigms that fail to enforce causal reasoning. This motivates future work on rationale-grounded objectives and causally faithful evaluation criteria.

Implications and Future Directions

The findings of MVI-Bench have both practical and theoretical implications:

Practical: Improved robustness to misleading visual inputs is essential for trustworthy deployment in safety-critical, open-world contexts. Benchmarking against MVI-Bench exposes crucial failure modes that must be addressed before real-world adoption.
Theoretical: Visual perception is foundational for robust multimodal reasoning. Training and evaluation paradigms in vision-language modeling should incorporate rationale-consistent supervision, address spurious correlations, and incentivize causal alignment within model architectures.
Future Research: Directions include perceptual module enhancement through diverse, labeled datasets (incorporating illusions and rare visual artifacts), advanced multimodal reasoning via RL and explicit causal chain-of-thought training, and development of evaluative frameworks beyond answer correctness.

Conclusion

MVI-Bench establishes a rigorous standard for evaluating LVLM robustness to misleading visual inputs and provides actionable insights for model development. Its paired and taxonomy-driven design enables fine-grained, controlled analysis of both perceptual and reasoning weaknesses. Results from 18 LVLMs highlight persistent vulnerabilities and elucidate mechanisms of model failure, underscoring the necessity for next-generation benchmarks, training protocols, and evaluation metrics that jointly advance robustness, interpretability, and causal reasoning in multimodal AI systems.