Hallucination-Associated Neurons in Neural Networks
- Hallucination-associated neurons are specialized neural units whose activation patterns reliably signal and causally induce ungrounded outputs in both artificial and biological networks.
- They are identified using techniques such as mutual information probing, sparse logistic regression, and attribution methods that quantify their influence on hallucination risk.
- Modulating these neurons through gain manipulation, ablation, or training adjustments offers actionable strategies to mitigate hallucination incidence and improve model reliability.
Hallucination-associated neurons are those neural units within biological or artificial neural networks whose activation patterns are systematically linked to the emergence of ungrounded, spurious, or factually incorrect perceptual or generative phenomena. In both LLMs and biological systems such as the primary visual cortex (V1), such neurons exhibit quantifiable and often causal relationships to hallucination-like outputs or experiences. Recent advances have isolated these neurons—using formal mathematical, probing, and causal inference techniques—thereby offering new vistas for interpretability, intervention, and the fundamental study of representational uncertainty in complex networked systems (Ji et al., 2024, Gao et al., 1 Dec 2025, Faugeras et al., 2021, Pan et al., 2023).
1. Formal Definitions and Theoretical Frameworks
The term “hallucination-associated neuron” (alternatively, “H-Neuron”) designates an individual neuron or low-dimensional neuronal subspace within a network whose activation predicts, drives, or reflects the generation of hallucinatory content.
- In LLMs, hallucination risk is mapped via a learned estimator acting on internal activations , trained to distinguish queries likely to elicit hallucinated (ungrounded) outputs from those producing faithful responses. High mutual information between specific neuron activations and hallucination outcomes operationalizes the core definition (Ji et al., 2024).
- In the feed-forward blocks of transformers, H-Neurons are those which receive strictly positive weights in a sparse -regularized logistic regression that classifies responses as faithful or hallucinatory. The contribution of neuron at token is given by the CETT ratio:
where is the “unmasked” residual with only neuron active (Gao et al., 1 Dec 2025).
- In V1 neurodynamics, hallucination-associated neurons are mapped to local neural populations whose spatial–chromatic tuning aligns with emergent, self-organized patterns (stripes, spots, or localized planforms) in neural-field models, especially when control parameters (e.g., nonlinearity gain ) cross bifurcation thresholds (Faugeras et al., 2021).
2. Identification Methodologies and Empirical Protocols
LLM-based frameworks deploy probing techniques to select neurons by their informativeness or direct causal impact for hallucination risk.
- Probing by mutual information: Individual neurons are ranked within hidden token vectors by the mutual information with the hallucination label. The top-ranking neurons (often in deep layers) decisively discriminate queries likely to provoke hallucinations (Ji et al., 2024).
- Sparse logistic regression (CETT): Neuron contributions, aggregated across answer and non-answer spans, are used as features in an logistic regression. H-Neurons are those with strictly positive learned weights , with robust predictive power across in-domain and out-of-domain tasks (Gao et al., 1 Dec 2025).
- Gradient-free attribution in multi-modal models: The contribution score measures the linear effect of neuron on hallucinated token logits in generated captions. Hallucination-relevance scores aggregate these contributions over hallucinated versus ground-truth tokens (Pan et al., 2023).
| Identification Method | Core Metric (Feature) | Typical Model/Application |
|---|---|---|
| Mutual Information Probing | LLM (text-gen) (Ji et al., 2024) | |
| CETT + Sparse Logistic Reg | LLM (QA, open-gen) (Gao et al., 1 Dec 2025) | |
| Attribution-Score Ranking | Multimodal LLM (Pan et al., 2023) |
Neural-field models in V1 leverage bifurcation analysis and symmetry techniques, explicitly tying the emergence of spatial–chromatic planforms to the collective activity of subsets of neurons—corresponding to theorized hallucination-associated populations (Faugeras et al., 2021).
3. Causal Characterization and Behavioral Impact
Hallucination-associated neurons have been shown to exert causal influence over output behavior, especially in LLMs.
- Gain manipulation experiments: Scaling the pre-activation of identified H-Neurons by a factor () during inference yields monotonic changes in “compliance” metrics—quantifying the model’s propensity for over-compliance, faith-unfaithful transitions, or sycophancy. Suppressing H-Neurons () reduces hallucination rates by up to -25 pp; amplifying () increases rates by +20 pp (Gao et al., 1 Dec 2025).
- Editing/Ablation in multimodal networks: Zeroing or adjusting weights associated with high neurons decreases hallucination rates in image captioning tasks from 22.8% to as low as 12.5%, without significant degradation of non-hallucinatory outputs (Pan et al., 2023).
- Causal validation in V1 models: Pharmacological or parameter-induced increase in network gain destabilizes the homogeneous state, recruiting specific pattern-selective populations (modeled as “hallucination-associated neurons”) whose joint activation underlies visually hallucinogenic percepts (Faugeras et al., 2021).
4. Origin, Stability, and Generalization
H-Neurons emerge early in model development and exhibit broad cross-contextual generalization.
- Pre-training inheritance: H-Neurons are present (i.e., persist in their functional mapping) in the base pre-trained models, as demonstrated by direct probing and similarity of weight trajectories. Minimal parameter shifts are observed in these neurons between pre-trained and instruction-tuned (aligned) models (Gao et al., 1 Dec 2025).
- Robust cross-domain signal: Probes learned on TriviaQA generalize to NQ-Open, BioASQ, and non-existent entity (“fabricated”) domains. AUROCs for hallucination prediction remain in the 0.80–0.95 range across six model families.
- Task and model specificity: While neuron-level “self-assessment” generalizes within-task (e.g., QA to unseen-QA), it weakens across tasks (e.g., QA to translation), implying task-sensitivity of hallucination cues (Ji et al., 2024).
- Stability of pattern solutions: In the V1 model, bifurcated solutions derived using the Equivariant Branching Lemma map to persistent and stable hallucination-associated planforms (e.g., stripes and spots), with local stability depending on parameter values and network symmetry (Faugeras et al., 2021).
5. Analytical Techniques and Measurement Criteria
A variety of analytical tools are used to study hallucination-associated neurons:
- Mutual information (Kraskov estimator):
used to rank neurons for their predictive value (Ji et al., 2024).
- Layer-wise probing and token attribution: Layerwise F1 scores and token-level gradients () localize which internal representations and input tokens drive the network toward high hallucination risk (Ji et al., 2024).
- CETT metric and penalized regression: Quantifies marginal contributions of neurons and imposes sparsity for interpretable selection (Gao et al., 1 Dec 2025).
- Attribution and editing in multi-modal LLMs:
allows targeted editing to abate hallucination without global retraining (Pan et al., 2023).
6. Broader Implications and Mitigation Strategies
The discovery, characterization, and intervention on hallucination-associated neurons enable targeted approaches to hallucination mitigation and interpretability.
- Detection and early warning: Lightweight neuron probes can provide real-time hallucination risk assessment pre-generation, facilitating proactive countermeasures such as retrieval augmentation or query refusal (Ji et al., 2024, Gao et al., 1 Dec 2025).
- Activation suppression and gating: Direct suppression of H-Neurons reduces hallucination and over-compliance, with only minor trade-offs in benign response helpfulness. Dynamic gating networks could offer adaptive modulation tied to real-time risk signals (Gao et al., 1 Dec 2025).
- Training objective modification: Injecting unanswerable questions and enforcing “I don’t know” outputs during pre-training may diminish the formation or influence of H-Neurons, addressing hallucination propensity at its origin (Gao et al., 1 Dec 2025).
- Interpretability and model auditing: Mapping and visualizing hallucination neurons aids in systematic auditing and understanding of internal uncertainty, moving toward architectures with robust self-assessment (Ji et al., 2024).
7. Hallucination-Associated Neurons Beyond LLMs: Visual Cortex Models
In primary visual cortex, the neural-field approach demonstrates that patterned spontaneous activity—mathematically characterized by bifurcated planforms—arises in subsets of "hallucination-associated neurons." Psychoactive modulation of gain or inhibition can drive the cortex from a homogeneous baseline to persistent, structured patterns, capturing phenomenological features of spatial and color hallucinations (e.g., entoptic stripes, spots) (Faugeras et al., 2021). The analytical tools—equivariant bifurcation theory, spectral analysis, and numerical continuation—offer mechanistic explanations for the emergence, stability, and diversity of neural correlates underlying hallucinatory states.
In summary, hallucination-associated neurons constitute a sparse, functionally crucial subset of units in both artificial and biological neural networks that reliably signal, and in some systems causally induce, hallucinatory outputs. Their mathematical identification, empirical mapping, and successful manipulation delineate a promising pathway for reducing hallucination incidence and architecting models with explicit uncertainty awareness (Ji et al., 2024, Gao et al., 1 Dec 2025, Faugeras et al., 2021, Pan et al., 2023).