Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 178 tok/s Pro

GPT OSS 120B 385 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers (2509.06938v1)

Published 8 Sep 2025 in cs.LG and cs.AI

Abstract: As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Summary

The paper demonstrates that transformer models generate coherent semantic concepts from pure noise, revealing structured activation patterns with sparse autoencoders.
The methodology employs sparse autoencoders to decompose internal activations, achieving up to 73% accuracy in predicting and mitigating hallucination risk.
The findings imply that targeted suppression of specific concept activations can serve as an effective intervention to reduce model hallucinations.

Tracing the Origins of Hallucinations in Transformers via Concept Probing

Introduction

This work presents a systematic investigation into the mechanistic origins of hallucinations in pre-trained transformer models, focusing on both vision and language modalities. The authors employ sparse autoencoders (SAEs) to decompose the internal activations of transformers into interpretable, high-dimensional concept spaces. By probing these spaces under controlled input uncertainty—including pure noise and structured perturbations—they reveal that transformers impose coherent semantic structure even on semantically void inputs. The paper further demonstrates that the activation patterns of these internal concepts are predictive of hallucination risk in model outputs, and that targeted suppression of specific concepts can reduce hallucination rates. These findings have direct implications for model interpretability, alignment, and safety.

Methodology: Probing Transformer Internals with Sparse Autoencoders

The core methodological innovation is the use of SAEs to map transformer residual stream activations into a sparse, overcomplete basis of semantic concepts. The SAE is trained to reconstruct the original activation vector $\mathbf{x} \in \mathbb{R}^{d_\mathrm{model}}$ as a sparse linear combination of learned direction vectors $\mathbf{d}_i$ :

$\mathbf{x} \approx \sum_i f_i(\mathbf{x}) \mathbf{d}_i + \mathbf{b}$

where $f_i(\mathbf{x})$ are non-negative, sparse activations, and $\mathbf{b}$ is a bias term. The loss function combines $L_2$ reconstruction error with $L_1$ sparsity regularization. SAEs are trained on activations from both natural and perturbed (including pure noise) inputs, across all layers of several transformer architectures (e.g., CLIP ViT-B/32, Pythia-160m, Gemma 2B-IT).

The workflow for evaluating hallucination risk and concept emergence is illustrated in Figure 1.

Figure 1: Workflow for examining hallucination risk in each transformer layer by training SAEs on residual stream activations from pure noise and probing with natural images.

Emergence of Semantic Concepts from Noise

A central empirical finding is that, when presented with pure Gaussian noise, pre-trained vision transformers (e.g., CLIP ViT-B/32) exhibit robust, highly interpretable concept activations in their intermediate layers. SAEs trained on noise activations, when probed with natural images, identify features that correspond to coherent semantic categories—despite never being exposed to real data during training. The semantic purity of these features is quantified by the clustering of CLIP-derived text labels for maximally activating images, with many features exceeding a strict threshold (cosine similarity > 0.75).

Figure 2: Top-activating ImageNet images for semantic concepts defined by a noise-trained SAE at layer 9 of a CLIP vision transformer, showing high semantic consistency.

Notably, a subset of these concepts is steerable: injecting the corresponding direction vector into the residual stream of a neutral input can causally induce the model to predict the associated semantic label. This demonstrates that the emergent concepts are not only interpretable but also functionally actionable within the model.

Layerwise Dynamics and Conceptual Wandering

Analysis across layers reveals a three-phase pattern in concept emergence and overlap:

Early layers: High overlap in concepts across SAE initializations, reflecting low-level, input-driven features.
Middle layers: Sharp drop in concept overlap and a dramatic increase in the number of active concepts (L0 norm), indicating stochastic exploration and "conceptual wandering" as the model imposes structure on ambiguous inputs.
Late layers: Partial reconvergence to a stable set of high-level features.

This pattern is robust to both vision and language modalities and is accentuated as input structure is degraded (e.g., by patch or n-gram shuffling). The number of active concepts peaks in the middle layers under high input uncertainty, contrary to the expectation that less structured inputs would yield fewer concepts.

Predicting and Suppressing Hallucinations via Concept Activations

The paper establishes a direct, quantitative link between the constellation of concept activations in the input and the hallucination propensity of the model's output. Using Gemma 2B-IT, the authors prompt the model to summarize long-form texts and score the outputs for hallucination using HHEM-2.1. Partial least squares (PLS) regression is then fit to predict hallucination scores from SAE concept activations at each layer.

The predictive power is maximized in intermediate layers (e.g., layer 13, $R^2 = 0.271 \pm 0.010$ ), with later layers showing no predictive capacity above chance. For binary classification (above/below median hallucination), mean accuracy reaches $73.0\% \pm 5.3\%$ .

Crucially, suppressing the top 10 hallucination-associated concepts (as identified by PLS variable importance) in the residual stream of layer 11 leads to a mean reduction in hallucination score of 0.19 in the most hallucinated quartile of examples. This demonstrates a primitive but effective mechanism for causal intervention on model behavior via internal concept manipulation.

Implications and Theoretical Significance

The findings challenge the assumption that transformer representations are strictly input-driven, revealing a strong inductive bias toward imposing semantic structure even in the absence of meaningful input. This has several implications:

Interpretability: The approach provides a scalable, dataset-independent framework for surfacing and quantifying internal model biases and failure modes.
Alignment and Safety: The ability to predict and suppress hallucinations via concept activations opens new avenues for model alignment and risk monitoring.
Adversarial Robustness: The expansion of concept usage under input uncertainty suggests a potential attack surface for adversarial manipulation.
Generalization: The consistency of results across modalities and architectures suggests that these phenomena are fundamental to transformer-based models.

Limitations and Future Directions

While the paper covers a range of model sizes and modalities, it remains to be seen how these results extrapolate to the largest frontier models. The reliance on SAEs, while empirically validated, is not a guarantee of perfect interpretability. Further work is needed to refine concept extraction methods, extend the approach to more complex tasks, and develop real-time hallucination monitoring and mitigation strategies for deployed systems.

Conclusion

This work provides a rigorous, quantitative framework for tracing the origins of hallucinations in transformer models by linking internal concept activations to output faithfulness. The demonstration that transformers impose semantic structure on noise, and that this structure predicts and can be used to suppress hallucinations, represents a significant advance in mechanistic interpretability. These insights lay the groundwork for principled monitoring, diagnosis, and intervention in transformer-based AI systems, with broad implications for safety, alignment, and robust deployment.