Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

Published 15 May 2026 in cs.CV and cs.CL | (2605.15864v1)

Abstract: Vision-LLMs (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper shows that VLMs fail at authentic visual re-examination, with accuracy drops up to 60% when visual inputs are swapped mid-inference.
Chain-of-thought reasoning variants are almost three times more vulnerable to contextual inertia than instruction-tuned models, leading to persistent errors.
Explicit external instructions restore visual grounding, indicating latent perceptual ability suppressed during self-reflective autonomous decoding.

Investigating the Authenticity of Visual Re-examination in Vision-LLMs

Overview

The paper "Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination" (2605.15864) presents a thorough analysis of whether modern Vision-LLMs (VLMs) genuinely perform visual re-examination during the reasoning process, especially when generating self-reflective statements. The authors introduce the VISUALSWAP framework and VS-BENCH benchmark to systematically probe the grounding of reflective reasoning in state-of-the-art VLMs—Qwen3-VL, Kimi-VL, and ERNIE-VL—by replacing input images mid-inference and evaluating models' capacity to reconsider based on the altered visual evidence.

VISUALSWAP and Benchmark Construction

The VISUALSWAP diagnostic protocol centers on a two-stage inference process:

The model is provided an image and a question, generating a reasoning chain and an answer.
After this initial reasoning, the visual input is replaced by a semantically distinct but visually similar image, while the context—including prior reasoning and a self-reflection prompt ("Wait, let me check the figure again")—is retained. The model must then continue its response.

VS-BENCH comprises 800 carefully curated image pairs across benchmarks (Math Vista, Math Verse, Math Vision, MMMU-Pro), ensuring both high visual similarity and crucial semantic differences. This construction enables discrimination between true visual re-engagement and mere linguistic pattern reproduction.

Experimental Results

Catastrophic Failure of Visual Re-examination

Across 15 models—spanning family and parameter scale—the VISUALSWAP probe reveals a severe deficiency: VLMs overwhelmingly fail to detect the semantic swap, persisting in reasoning from the outdated visual context. Quantitatively, probe accuracy degrades by up to 60% compared to baseline performance. For example, ERNIE-4.5-VL-Thinking's accuracy falls precipitously from 79.9% to 19.6%, and Qwen3-VL-235B-Thinking drops from 88.8% to 34.1%. These failures occur despite explicit self-reflective statements intended to simulate visual re-examination.

Counterintuitive Fragility of Thinking Variants

A critical result is that chain-of-thought "thinking" variants are almost three times more vulnerable to this failure than instruction-tuned counterparts. Qwen3-VL-32B-Thinking degrades by 48.3%, while its Instruct version shows only a 17.9% drop. Extended reasoning chains in thinking models create substantial "reasoning inertia," anchoring the model to prior textual outputs and suppressing visual re-grounding.

Ineffectiveness of Model Scaling

Contrary to expectations, increasing model scale does not mitigate the illusion of re-examination. Larger models such as Qwen3-VL-235B-A22B-Thinking become even more prone to the inertia effect, with degradation rising from 39.4% (8B) to 54.6% (235B).

External Instruction Restores Grounding

Explicit multi-turn user instructions—where the new image and an instruction to re-examine are supplied via a fresh user turn—almost completely recover visual grounding. For instance, Qwen3-VL-235B-A22B-Thinking rebounds to 85.4% accuracy, nearly matching baseline (88.8%). Thus, the visual re-examination capability exists latently, but self-generated reflective statements in continuous decoding do not trigger it.

Mechanistic Analysis

Attention Dynamics

Attention analysis demonstrates that self-reflective statements result in negligible increases in attention to image tokens. In contrast, external user instructions substantially elevate attention on visual representations, triggering genuine regrounding. The roots of the illusion are therefore not architectural incapacity, but a failure of autonomous attentional control during autoregressive, context-heavy generation.

Reasoning Context Inertia

The effect becomes more severe as more of the original reasoning context is retained: longer chains further entrench the model in its prior textual trajectory, decoupling it from current visual evidence.

Prompt Robustness and Naturalistic Triggers

The phenomenon is robust across diverse prompt phrasings and persists even when swaps are aligned with naturally occurring self-reflective triggers produced by models themselves. The effect is thus not an artifact of artificial prompt injection.

Causal Effect of Visual Attention Manipulation

Externally amplifying attention to image tokens (by scaling attention weights) partially heals the blindness, most notably in thinking variants (+18.2% accuracy gain), supporting the causal role of attentional suppression in the failure.

Implications

Practical Implications

Failure to perform authentic visual re-examination carries substantial risk in safety-critical applications such as medical imaging and autonomous systems, where overreliance on asserted self-verification can lead to undetected errors and hallucinations. The study cautions against relying on self-reflective chain-of-thought as an indicator of trustworthiness in the absence of verifiable attentional engagement.

Theoretical Implications

The findings underscore a fundamental limitation of current CoT-based VLMs: intrinsic reasoning processes prioritize text-based coherence over dynamic sensory re-grounding. This exposes a mismatch between linguistic simulation of reasoning and true multimodal integration. The distinction between capability and control is made explicit; models can process new images competently if externally cued, but lack mechanisms for autonomous, context-driven attention resetting.

Future Directions

Improvements likely require architectural or training protocol innovations to incorporate mechanisms for dynamic attentional control. This could involve training with explicit visual verification steps, introducing auxiliary losses tied to visual attention activation, or restructuring decoding paradigms to facilitate visual re-engagement during long-form reasoning. The diagnostic protocol established here provides a robust framework for evaluating progress on this front.

Conclusion

This work rigorously demonstrates that state-of-the-art VLMs, when left to their own devices, merely "say" they are re-examining visual input rather than genuinely "seeing" again during self-reflection. The illusion of visual re-examination is robust to scaling, prompting, and model design. The attenuation of this effect by explicit user intervention or forced attention highlights the gap between multimodal reasoning as linguistic simulation and genuine perceptual verification. Addressing this gap constitutes a central challenge for advancing reliable, autonomous multimodal AI (2605.15864).