MIRAGE: The Illusion of Visual Understanding
This presentation reveals a fundamental flaw in how we evaluate multimodal AI systems. Frontier models consistently generate confident visual descriptions and achieve competitive benchmark scores even when given no images at all—a phenomenon called the 'mirage effect.' Through rigorous testing across medical and general-purpose benchmarks, this work exposes how models rely on hidden patterns rather than genuine visual reasoning, with dangerous implications for high-stakes domains like healthcare. The presentation introduces B-Clean, a framework for creating truly vision-dependent evaluations that challenge our assumptions about what these systems actually see and understand.Script
Most visual AI benchmarks are broken, and the models know it. Frontier multimodal systems achieve competitive scores on vision tasks without ever seeing the images, generating confident descriptions of medical scans, diagrams, and photographs that were never provided.
The mirage effect occurs when models process visual questions without images yet behave as though they had received one. Testing revealed that GPT-5, Gemini-3-Pro, Claude Opus 4.5, and others all exhibited mirage rates exceeding 60 percent, escalating to near-total rates under standard evaluation prompts.
These mirages are not harmless errors.
In medical domains, mirages become dangerous. Models fabricated sensitive clinical findings, disproportionately favoring serious pathologies when interrogated about unseen data—generating actionable diagnoses without any epistemic foundation.
This figure demonstrates the B-Clean framework in action. By systematically removing every question that any model answered correctly without an image, the method exposes how deeply compromised standard benchmarks are. Between half and three-quarters of questions were eliminated, and critically, model rankings changed dramatically—revealing that competitive performance claims were built on invisible scaffolding rather than genuine visual reasoning.
The distinction between reasoning regimes is revealing. When explicitly asked to guess without images, models perform worse than when they silently assume image presence. This gap demonstrates that traditional benchmark curation fails to detect the deep statistical patterns models exploit—patterns invisible to manual artifact detection or prompt-based controls.
The mirage effect fundamentally undermines claims of visual understanding in multimodal AI. High accuracy is not evidence of seeing—it may be evidence of sophisticated pattern matching on invisible scaffolding. Visit EmergentMind.com to explore this work further and create your own research videos.