Single-Modality Visual Hallucination

Updated 24 November 2025

Single-modality visual hallucination is defined as the generation of incorrect visual outputs due solely to intra-modal interference, evident in both artificial and biological systems.
In artificial systems, benchmarks isolate hallucinations by embedding distractor text in images, revealing vulnerabilities in vision-language models and guiding mitigation strategies.
In biological systems, psychophysical experiments and neural-field models show that lateral interactions in early visual cortices drive perceptual distortions and pattern formations.

Single-modality visual hallucination refers to the phenomenon where a system operating exclusively within the visual stream—either artificial (e.g., Vision-LLMs, or VLMs) or biological (e.g., human cortex)—produces outputs or internal states that are at odds with the actual visual input, in the absence of cross-modal (e.g., auditory or textual) information. In computational systems, this typically manifests as erroneous visual attribute identification or the “hallucinated” generation of visual content from corrupted, ambiguous, or otherwise impaired visual input. In biological systems, it encompasses both percept-like subjective experiences triggered by visual stimulation and neural patterns in early visual cortices not corresponding to the stimulus. Recent research addresses single-modality visual hallucination in both the neuroscience and machine perception contexts, elucidating mechanisms, empirical manifestations, and practical implications.

1. Formal Definitions and Theoretical Framing

Single-modality visual hallucination in modern multimodal large models is formally defined in "What Color Is It? A Text-Interference Multimodal Hallucination Benchmark" as the event where a vision-LLM produces an incorrect output about a purely visual attribute (e.g., color of text) due to intra-visual interference, despite all cues (prompt, distractor, answer) being confined to the visual modality (Zhao et al., 17 Nov 2025). Let $I$ be an image containing text $T$ rendered in color $C_{true}$ and a distractor (another color name embedded in the same image, e.g., the string "red" printed in blue). A hallucination is flagged if the model answers $C_{distractor}$ rather than $C_{true}$ , satisfying:

$A = M(I),\quad \text{hallucination} \iff A \in S(Text(I))$

where $S(Text(I))$ is the set of semantic colors evoked by the distractor text in $I$ . This definition operationalizes single-modality hallucination as a conflict within a unitary sensory channel, rather than cross-modal confusion.

In computational neuroscience, single-modality visual hallucination is modeled as emergent cortical activity patterns, generated by lateral interactions and symmetry-breaking inputs, with no requisite cross-modal contribution (Tamekue et al., 2022). These models, such as the Amari-type neural field:

$\partial_t a(x,t) = -\alpha a(x,t) + \mu\int \omega(x-x')f(a(x',t))\,dx' + I(x)$

replicate psychophysical phenomena where specific geometric visual inputs induce complementary or spurious perceptual experiences.

2. Experimental Methodologies and Benchmarks

Artificial Systems

Hallucination benchmarks are engineered to isolate visual interference. "What Color Is It?" introduces a dataset of 1,200 images systematically varying the alignment and masking of distractor color words and their visual (RGB) attributes. Models receive only the visual stream (the image); metrics such as accuracy and hallucination rates (Type I: text-based; Type II: question-based) are computed, with formal conditions:

Accuracy: $A = C_{true}$
Hallucination-1: $A \in S(Text(I))$
Hallucination-2: $A \in S(Q)$

Prompt-in-Image frameworks further enforce single-modality processing by rasterizing the text question into the image, eliminating text encoders and ensuring all information is ingested through a visual encoder (Wang et al., 3 Aug 2025). Performance, attention allocation, and cross-modal alignment are measured using metrics such as POPE accuracy, CHAIR hallucination scores, and modality-gap evaluations.

In the context of sensor fusion and robustness, modality hallucination architectures learn mappings $f_\theta: X \rightarrow Y$ to reconstruct ‘lost’ modalities (e.g., RGB) from an available single modality (e.g., depth), typically optimized for loss functions combining RMSE and edge-aware smoothness terms (Gunasekar et al., 2020). Downstream classifiers and segmenters are used to quantify the task utility of hallucinated data.

Biological Systems

Psychophysical paradigms, such as Ganzflicker (rapid alternation of red/black fullscreen stimulation), induce visual hallucinations across the imagery vividness spectrum. Descriptions generated by thousands of participants are clustered and quantified to reveal the content structure of single-modality hallucinations, with further embedding via vision-language and text-only models to explore representational richness across subjective phenotypes (Chkhaidze et al., 11 Jul 2025).

V1 neural-field models replicate sensory-induced hallucinations (e.g., MacKay effect) by numerically integrating cortical activity with specific geometric inputs, observing the emergence or absence of complementary patterns contingent on input symmetry-breaking (Tamekue et al., 2022).

3. Empirical Manifestations and Mechanistic Insights

Artificial Models

Across SOTA vision-LLMs, empirical findings show:

Small-parameter VLMs (e.g., InternVL, Qwen2.5-VL) score near 0% accuracy on visually interfering tasks, with text-based hallucination rates (H1) near 100%.
Scale and reasoning ability (e.g., GPT-4o, GPT-5, chain-of-thought prompting) improve performance (accuracy $<$ 55%, H1 $\approx$ 15–29%) but do not solve single-modality hallucinations.
Masking semantically salient text reduces hallucination but performance is non-monotonic; powerful models may recreate masked distractors through implicit infilling.
Chain-of-thought models exhibit intra-modal oscillation: initial visual reasoning correct, but later stages erroneously latch onto distractor features (Zhao et al., 17 Nov 2025).

When the question is embedded visually (Prompt-in-Image), models pretrained on OCR-rich distributions (Qwen2.5-VL) improve (POPE accuracy +4.1%, CHAIRs $-7.6\%$ ), while CLIP-based encoders (LLaVA, InstructBLIP) catastrophically fail due to attention collapse onto overlaid text regions (Wang et al., 3 Aug 2025).

Biological Hallucinations

Human participants exposed to Ganzflicker report a spectrum of visual hallucinations:

Weak imagers (aphantasia): geometric forms, color flashes, lines.
Vivid imagers: complex, naturalistic content (faces, landscapes, structures). Vision-language embeddings (CLIP, SigLIP) best discriminate group-wise differences, with representational dissimilarity correlated to vividness ratings (CLIP $ρ=0.76$ , $p<.001$ ) (Chkhaidze et al., 11 Jul 2025).

Neural-field simulations confirm that pattern formation in V1 (spirals, grids, “tunnel” effects) emerges only with appropriate excitability and symmetry-breaking stimulus. Complementary hallucinations are modeled as unique stationary solutions to the driven Amari equation, subject to input symmetries (Tamekue et al., 2022).

4. Applications and Practical Interventions

Single-modality visual hallucination frameworks underpin various practical domains:

Domain	Hallucination Role	Implementation/Metric
MLM/VLM evaluation	Diagnostic benchmark for visual robustness	Accuracy, H1, H2
Robotics/Perception	Synthesizing RGB from depth or thermal-only input	$\ell_{rmse} + λ\ell_{smooth}$ , task performance (Gunasekar et al., 2020, Saputra et al., 2019)
Human vision science	Probing pathway integration, imagery deficits	Content/topic analysis, psychometrics (Chkhaidze et al., 11 Jul 2025)

For model training and inference, mitigation strategies include:

Text-masking data augmentation during MLM training, reducing semantic interference in distractor text (Zhao et al., 17 Nov 2025).
Distractor-aware decoding, enforcing dynamic contrastive exclusion of distractor classes at logit-level.
Explicit hallucination-penalizing objectives during fine-tuning.
In low-level visual perception, self-awareness augmentation and preference optimization (SAFEQA + ESA-PO), which train refusal as a valid answer, yielding substantial reductions in low-level hallucination rates (Sun et al., 26 Mar 2025).

In robotics, hallucination networks are trained (e.g., with Huber loss) to synthesize visual features from thermal or depth modalities, with selective fusion used to gate unreliable features and maximize robust pose estimation (Gunasekar et al., 2020, Saputra et al., 2019).

5. Neural and Computational Mechanisms

The underlying mechanisms of single-modality hallucination are multifold:

In vision-LLMs, spurious feature selection typically results from over-reliance on visually prominent but semantically distracting patches, exacerbated by architectural and pretraining biases (CLIP attentional collapse on text regions, lack of cross-modal grounding) (Wang et al., 3 Aug 2025).
In biological systems, Mexican-hat lateral connectivity in V1 and retino-cortical mapping drive the emergence of canonical pattern “form constants” under certain conditions. Explicit symmetry breaking in input is required for “complementary” hallucinations (e.g., MacKay’s funnel-tunnel phenomena) (Tamekue et al., 2022).
Individual differences in imagery vividness reflect the hierarchical recruitment of visual feedback: early V1 suffices for simple patterns, while top-down integration across fusiform, parahippocampal, and prefrontal networks is required for naturalistic hallucination (Chkhaidze et al., 11 Jul 2025).

6. Limitations, Open Problems, and Future Directions

Current computational models of single-modality visual hallucination remain limited to tightly controlled settings: color naming, low-level perceptual attribute judgment, or RGB synthesis from structured alternatives. Generalization to more unconstrained, out-of-distribution, or adversarially crafted visual distractors is not guaranteed, and existing mitigation (e.g., character masking, penalty loss shaping) can be subverted by sophisticated models capable of implicit infill (Zhao et al., 17 Nov 2025). In the biological domain, extensions to dynamic, multimodal, or contextually embedded hallucinations (e.g., audio–visual) remain unmodeled.

A plausible implication is that future model architectures benefiting from explicit intra-image modality gating, adversarial training on text-within-image distractors, and disentangled representation learning of visual features will exhibit increased robustness to intra-modal hallucinations. Similarly, hybrid “disjudgement” objectives—teaching models to withhold output in regions of knowledge uncertainty—appear empirically effective in narrowing the hallucination boundary (Sun et al., 26 Mar 2025).

7. Broader Implications

Single-modality visual hallucination benchmarks provide essential "unit tests" for elemental perceptual skills (color, shape, texture), and expose core vulnerabilities in both artificial and biological vision systems to intra-modal informational interference. Their study is central for future robust scene understanding, foundational neuroscience, and the deployment of multimodal AI in safety-critical scenarios (Zhao et al., 17 Nov 2025, Tamekue et al., 2022, Chkhaidze et al., 11 Jul 2025).