Hallucination-Associated Neurons in LLMs
- H-Neurons are defined as a sparse subset of feed-forward units whose activations statistically predict hallucination events using an ℓ1-regularized logistic regression probe.
- They are identified by constructing balanced datasets and engineering CETT metrics, achieving up to 83% accuracy in detecting hallucinations across domains.
- Experimental activation scaling demonstrates that H-Neurons causally influence model compliance, informing strategies for precise hallucination mitigation.
Hallucination-Associated Neurons (H-Neurons) are a sparse subset of feed-forward network (FFN) units in LLMs whose activity is tightly predictive of hallucination events—outputs that are plausible but factually incorrect. Recent work has provided a formal and empirical foundation for identifying, quantifying, and causally intervening upon these neurons. The existence of H-Neurons offers a bridge between macroscopic hallucination phenomena and the microscopic mechanisms encoded in neural architectures, furnishing tools for more reliable detection and mitigation of factual errors in LLM outputs (Gao et al., 1 Dec 2025).
1. Formal Definition and Mathematical Properties
An H-Neuron is defined with reference to the activations of all neurons in the FFN layers of an LLM. For total neurons, each neuron receives a weight from a sparse -regularized logistic regression probe. Given per-neuron features for a single response, the probability models hallucination presence ().
To quantify individual contribution, the CETT (Causal Effect on Token Trajectory) metric is used: where is the total token-wise hidden update and is the component due to neuron . Aggregation over answer tokens or non-answer tokens yields: The probe’s penalty ensures only a minuscule fraction (typically ) receive nonzero weights , forming the set of H-Neurons .
2. Identification Protocols and Predictive Generalization
The canonical methodology for identifying H-Neurons consists of:
- Dataset Construction: Balanced faithfully factual and fully hallucinatory responses, as in 1,000-trial splits from TriviaQA, are collected with randomized sampling (temperature=1.0, top_k=50, top_p=0.9) and filtered for consistency.
- Feature Engineering: Per-example vectors concatenate the metrics across all FFN neurons for both answer and non-answer spans.
- Label Assignment and Probe Training: Answer-span features from hallucinated outputs are labeled , and all others . The sparse logistic objective is solved for weights .
- H-Neuron Selection: Neurons with are designated H-Neurons.
- Evaluation Metrics: Precision, recall, F1, and cross-domain generalization (on NQ-Open, BioASQ, NonExist) are reported.
A critical empirical finding is that using fewer than of all FFN units, the probe achieves accuracy gains from – (baseline) to – on held-out and out-of-domain hallucination detection tasks.
3. Behavioral Causality and Intervention Studies
Direct manipulation of H-Neuron activations establishes their causal influence on model behavior:
- Activation Scaling: Pre-nonlinearity activation of H-Neuron at token is re-scaled: , . The corresponding causal contribution scales linearly: .
- Compliance Benchmarks: Behavioral impact is measured across benchmarks targeting over-compliance, including FalseQA, FaithEval, Sycophancy, and Jailbreak.
- Quantitative Effects: Compliance rate increases monotonically with and decreases for ; average compliance slope (small models) and $2.40$ (large models). One-sided -tests confirm the statistical significance ( for most ).
This establishes H-Neurons as a direct cause of hallucinatory and over-compliant behaviors.
4. Tracing Neural Origins: Pre-training versus Instruction Tuning
A core finding is that H-Neurons originate primarily during pre-training and are largely unaffected by downstream instruction tuning:
- Backward Transferability: A probe trained on instruction-tuned activations () is directly applied to base (pre-SFT) model activations, achieving AUROC (up to on TriviaQA), confirming predictive value absent alignment data.
- Drift Quantification: For each neuron , projection weight drift is measured:
Aggregate drift is z-normalized and rank-normalized; H-Neurons cluster at high (mean , ), indicating minimal alteration post-alignment. This suggests the emergence of “compliance” circuits is largely a consequence of the foundational training, not alignment stages.
5. Applied Implications and Mitigation Approaches
Key findings on H-Neurons directly inform strategies for detection and mitigation of hallucinations:
- Neuronal Probes: Lightweight, model-agnostic probes leveraging only the sparse H-Neuron set can serve as efficient hallucination detectors for LLM outputs.
- Activation Suppression: Real-time suppression () of these units reduces both hallucination and over-compliance across evaluation sets. However, uniform scaling impairs model helpfulness, indicating a need for selective or task-sensitive modulation strategies.
- Architectural Interventions: Recommendations include:
- Dynamic gating or mask layers to down-weight H-Neurons when accuracy is required,
- Regularization during pre-training to discourage over-dependence on compliance-encoding neurons,
- Modification of pre-training objectives through calibration losses or uncertainty penalties to disfavor formation of these “compliance” circuits.
A summary of H-Neuron research axes and major results is presented below:
| Dimension | Finding | Quantitative Summary |
|---|---|---|
| Sparsity | Probe using 0.1% neurons | |
| Generalization | AUROC across domains | Up to (TriviaQA) |
| Causal Impact | Monotonic compliance increase with | Compliance2.4–3.0 |
| Origin | Pre-training phase | High with minimal drift |
6. Theoretical and Practical Significance
H-Neurons provide a mechanistic linkage between single-unit FFN dynamics and global LLM failure modes, reconciling macroscopic over-compliance behaviors with microscopic neural substrates. This substantiates a neuron-level “compliance bias” that is upstream of supervised alignment, challenging the assumption that hallucinations are solely byproducts of post-pretraining tuning or data quality.
*A plausible implication is that robust factuality will not be achieved solely by alignment or prompting, but may require architectural and pre-training design changes to disrupt compliance-related neural circuits before their consolidation.
7. Open Directions and Future Work
Current suppression methods for H-Neuron activity, while effective at reducing hallucination and over-compliance, are blunt, sometimes degrading answer helpfulness. Future research directions include the development of finer-grained neuron editing or gating systems, neuron-level regularization during training, and a deeper exploration of the interaction between H-Neuron dynamics and model scaling laws. The potential for targeted architectural interventions, such as adaptive mask layers or uncertainty-guided gating, is currently under investigation (Gao et al., 1 Dec 2025).