- The paper introduces a tool-guided framework that uses immutable image resource versioning to systematically analyze VLM responses on visual illusions.
- It demonstrates structural generalization using per-category tool-use strategies and maintains performance across varied test cases on VI-Probe and VIA-Bench.
- It identifies key failure modes such as positive-detection bias, spatial precision without logical consequence, and noise sensitivity, highlighting data-induced calibration challenges.
Problem Statement and Motivation
"Seeing the Evidence, Missing the Answer: Tool-Guided Vision-LLMs on Visual Illusions" (2603.29428) systematically interrogates Vision-LLMs (VLMs) on tasks requiring perceptually counterfactual reasoning with visual illusions. VLMs, despite excelling at broad visual question answering, manifest a pronounced bias: when presented with canonical illusion images, even those surgically altered to remove the illusory effect, models persistently predict the illusion as "real." The authors delineate this failure as rooted in training-data imbalance rather than weak representations. Public web corpora and pre-training datasets are saturated with genuine illusions, and so structural priors dominate over direct visual evidence.
Methodological Contributions
The paper presents a tool-guided framework operating exclusively in inference mode—no model weight updates are allowed, aligning with the DataCV 2026 challenge restrictions. The backbone is a Gemini Flash-family VLM deployed zero-shot, augmented by access to a constrained library of generic image manipulation tools: geometric annotation primitives (line, rectangle, circle), region cropping, side-by-side comparison, and, for real-world anomaly detection (Task II), extensions like channel isolation and color sampling.
A key architectural innovation is the immutable image resource versioning system: every tool invocation produces a new image resource, recorded in a registry and never overwrites its source. The full registry remains accessible throughout the multi-step agentic reasoning chain, enabling compositional analysis and revisitation of intermediate hypotheses. This paradigm is analogous to immutable value semantics in functional programming but uniquely applied to visual annotation.
The routing logic eschews code synthesis or module library expansion (cf. VisProg [gupta2023visprog]) and instead encodes per-category tool-use strategies in the system prompt. Each question is classified into a perceptual category (e.g., size comparison, boundary detection), mapped to a recommended tool sequence, and executed iteratively in a ReAct-style loop. This design yields robust structural generalization: the agent applies identical reasoning strategies to previously unseen variants (e.g., Mach Bands rotated horizontally) without prompt revision.
Empirical Evaluation
The framework was evaluated on VI-Probe (Task I) and VIA-Bench (Task II), leveraging official validation and test splits. Accuracy persisted across structurally unfamiliar test cases, substantiating cross-structural generalization. Analysis of tool-call logs revealed clear separation between geometric annotation (dominant in classic illusions) and broader analysis operators (activated for real-world anomaly detection).
Numerically, the paper underscores that accuracy remains stable between validation and test sets, even when the test set introduces novel orientations or stimulus configurations. For boundary-detection questions, prompt-embedded strategies effectively transferred to horizontally-stacked color regions, contrasting with brittle module-specific approaches.
Analysis of Systematic Failure Modes
Three robust failure modes emerged:
- Positive-Detection Bias: VLMs consistently overpredict positive illusion cases, especially when images are modified to break the illusion. The authors attribute this to extreme pre-training bias—illusory stimuli overwhelmingly dominate public datasets. When structural features resemble known illusion templates, the language generator's prior overrules direct visual input.
- Spatial Precision Without Logical Consequence: Despite high-fidelity spatial annotation—precise line placement, crop region selection—the models often fail to draw correct logical inferences from their own annotations. For instance, after overlaying a perfect reference line on a curved target line, the model asserts straightness, exemplifying a dissociation between perceptual precision and inference integrity.
- Compression and Noise Sensitivity: JPEG artifacts and chroma subsampling introduce small spurious differences at region boundaries, which models interpret as meaningful (either boundaries or color differences), producing false positives or negatives. The system's calibration around minimal pixel-level discrepancies is suboptimal; compression artifacts compound misclassifications.
Implications and Prospects
The research delineates that architectural enhancements alone are insufficient for counterfactual perceptual robustness in VLMs. The dissociation between precise spatial annotation and unreliable inferential judgment is not a uniform visual weakness—it is systematically tied to data-induced priors and calibration policies. Practical improvement mandates targeted augmentation of negative illusion instances during data curation, robust prompt-level threshold guidance, and perhaps pre-processing to normalize input image statistics.
Theoretical implications extend to agentic multimodal reasoning: compositional tool-use via prompt-defined strategies and immutable visual state enables generalization without code synthesis overhead, but inference reliability still bottlenecks on training data conventions and calibration logic.
Conclusion
This work advances a prompt-embedded, tool-guided framework for visual illusion analysis in VLMs, achieving robust structural generalization without retraining. Critical observations of systematic positive-detection bias, perceptual–inferential dissociation, and artifact sensitivity underscore that future progress requires not only algorithmic advances in multimodal reasoning but also dedicated counterfactual data augmentation and improved calibration strategies. The compositional annotation and registry-based analysis pipeline provides a scalable architecture for perceptually challenging VQA tasks, but inference reliability remains contingent on both data distribution and modeling conventions.