- The paper shows that VLMs fail at authentic visual re-examination, with accuracy drops up to 60% when visual inputs are swapped mid-inference.
- Chain-of-thought reasoning variants are almost three times more vulnerable to contextual inertia than instruction-tuned models, leading to persistent errors.
- Explicit external instructions restore visual grounding, indicating latent perceptual ability suppressed during self-reflective autonomous decoding.
Investigating the Authenticity of Visual Re-examination in Vision-LLMs
Overview
The paper "Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination" (2605.15864) presents a thorough analysis of whether modern Vision-LLMs (VLMs) genuinely perform visual re-examination during the reasoning process, especially when generating self-reflective statements. The authors introduce the VISUALSWAP framework and VS-BENCH benchmark to systematically probe the grounding of reflective reasoning in state-of-the-art VLMs—Qwen3-VL, Kimi-VL, and ERNIE-VL—by replacing input images mid-inference and evaluating models' capacity to reconsider based on the altered visual evidence.
VISUALSWAP and Benchmark Construction
The VISUALSWAP diagnostic protocol centers on a two-stage inference process:
- The model is provided an image and a question, generating a reasoning chain and an answer.
- After this initial reasoning, the visual input is replaced by a semantically distinct but visually similar image, while the context—including prior reasoning and a self-reflection prompt ("Wait, let me check the figure again")—is retained. The model must then continue its response.
VS-BENCH comprises 800 carefully curated image pairs across benchmarks (Math Vista, Math Verse, Math Vision, MMMU-Pro), ensuring both high visual similarity and crucial semantic differences. This construction enables discrimination between true visual re-engagement and mere linguistic pattern reproduction.
Experimental Results
Catastrophic Failure of Visual Re-examination
Across 15 models—spanning family and parameter scale—the VISUALSWAP probe reveals a severe deficiency: VLMs overwhelmingly fail to detect the semantic swap, persisting in reasoning from the outdated visual context. Quantitatively, probe accuracy degrades by up to 60% compared to baseline performance. For example, ERNIE-4.5-VL-Thinking's accuracy falls precipitously from 79.9% to 19.6%, and Qwen3-VL-235B-Thinking drops from 88.8% to 34.1%. These failures occur despite explicit self-reflective statements intended to simulate visual re-examination.
Counterintuitive Fragility of Thinking Variants
A critical result is that chain-of-thought "thinking" variants are almost three times more vulnerable to this failure than instruction-tuned counterparts. Qwen3-VL-32B-Thinking degrades by 48.3%, while its Instruct version shows only a 17.9% drop. Extended reasoning chains in thinking models create substantial "reasoning inertia," anchoring the model to prior textual outputs and suppressing visual re-grounding.
Ineffectiveness of Model Scaling
Contrary to expectations, increasing model scale does not mitigate the illusion of re-examination. Larger models such as Qwen3-VL-235B-A22B-Thinking become even more prone to the inertia effect, with degradation rising from 39.4% (8B) to 54.6% (235B).
External Instruction Restores Grounding
Explicit multi-turn user instructions—where the new image and an instruction to re-examine are supplied via a fresh user turn—almost completely recover visual grounding. For instance, Qwen3-VL-235B-A22B-Thinking rebounds to 85.4% accuracy, nearly matching baseline (88.8%). Thus, the visual re-examination capability exists latently, but self-generated reflective statements in continuous decoding do not trigger it.
Mechanistic Analysis
Attention Dynamics
Attention analysis demonstrates that self-reflective statements result in negligible increases in attention to image tokens. In contrast, external user instructions substantially elevate attention on visual representations, triggering genuine regrounding. The roots of the illusion are therefore not architectural incapacity, but a failure of autonomous attentional control during autoregressive, context-heavy generation.
Reasoning Context Inertia
The effect becomes more severe as more of the original reasoning context is retained: longer chains further entrench the model in its prior textual trajectory, decoupling it from current visual evidence.
Prompt Robustness and Naturalistic Triggers
The phenomenon is robust across diverse prompt phrasings and persists even when swaps are aligned with naturally occurring self-reflective triggers produced by models themselves. The effect is thus not an artifact of artificial prompt injection.
Causal Effect of Visual Attention Manipulation
Externally amplifying attention to image tokens (by scaling attention weights) partially heals the blindness, most notably in thinking variants (+18.2% accuracy gain), supporting the causal role of attentional suppression in the failure.
Implications
Practical Implications
Failure to perform authentic visual re-examination carries substantial risk in safety-critical applications such as medical imaging and autonomous systems, where overreliance on asserted self-verification can lead to undetected errors and hallucinations. The study cautions against relying on self-reflective chain-of-thought as an indicator of trustworthiness in the absence of verifiable attentional engagement.
Theoretical Implications
The findings underscore a fundamental limitation of current CoT-based VLMs: intrinsic reasoning processes prioritize text-based coherence over dynamic sensory re-grounding. This exposes a mismatch between linguistic simulation of reasoning and true multimodal integration. The distinction between capability and control is made explicit; models can process new images competently if externally cued, but lack mechanisms for autonomous, context-driven attention resetting.
Future Directions
Improvements likely require architectural or training protocol innovations to incorporate mechanisms for dynamic attentional control. This could involve training with explicit visual verification steps, introducing auxiliary losses tied to visual attention activation, or restructuring decoding paradigms to facilitate visual re-engagement during long-form reasoning. The diagnostic protocol established here provides a robust framework for evaluating progress on this front.
Conclusion
This work rigorously demonstrates that state-of-the-art VLMs, when left to their own devices, merely "say" they are re-examining visual input rather than genuinely "seeing" again during self-reflection. The illusion of visual re-examination is robust to scaling, prompting, and model design. The attenuation of this effect by explicit user intervention or forced attention highlights the gap between multimodal reasoning as linguistic simulation and genuine perceptual verification. Addressing this gap constitutes a central challenge for advancing reliable, autonomous multimodal AI (2605.15864).