Papers
Topics
Authors
Recent
2000 character limit reached

Hallucination-Associated Neurons in LLMs

Updated 2 December 2025
  • H-Neurons are defined as a sparse subset of feed-forward units whose activations statistically predict hallucination events using an ℓ1-regularized logistic regression probe.
  • They are identified by constructing balanced datasets and engineering CETT metrics, achieving up to 83% accuracy in detecting hallucinations across domains.
  • Experimental activation scaling demonstrates that H-Neurons causally influence model compliance, informing strategies for precise hallucination mitigation.

Hallucination-Associated Neurons (H-Neurons) are a sparse subset of feed-forward network (FFN) units in LLMs whose activity is tightly predictive of hallucination events—outputs that are plausible but factually incorrect. Recent work has provided a formal and empirical foundation for identifying, quantifying, and causally intervening upon these neurons. The existence of H-Neurons offers a bridge between macroscopic hallucination phenomena and the microscopic mechanisms encoded in neural architectures, furnishing tools for more reliable detection and mitigation of factual errors in LLM outputs (Gao et al., 1 Dec 2025).

1. Formal Definition and Mathematical Properties

An H-Neuron is defined with reference to the activations of all neurons in the FFN layers of an LLM. For DD total neurons, each neuron jj receives a weight θj\theta_j from a sparse 1\ell_1-regularized logistic regression probe. Given per-neuron features xRDx\in\mathbb{R}^D for a single response, the probability P(y=1x)=σ(θx)P(y=1|x)=\sigma(\theta^\top x) models hallucination presence (y=1y=1).

To quantify individual contribution, the CETT (Causal Effect on Token Trajectory) metric is used: CETTj,t=ht(j)2ht2\mathrm{CETT}_{j,t}=\frac{\|h_t^{(j)}\|_2}{\|h_t\|_2} where htRdh_t\in\mathbb{R}^d is the total token-wise hidden update and ht(j)h_t^{(j)} is the component due to neuron jj. Aggregation over answer tokens AA or non-answer tokens yields: CETTj,answer=1AtACETTj,t,CETTj,other=1TAtTACETTj,t\overline{\mathrm{CETT}_{j,\text{answer}}} = \frac{1}{|A|} \sum_{t\in A} \mathrm{CETT}_{j,t}, \quad \overline{\mathrm{CETT}_{j,\text{other}}} = \frac{1}{|T\setminus A|} \sum_{t\in T\setminus A} \mathrm{CETT}_{j,t} The probe’s 1\ell_1 penalty ensures only a minuscule fraction (typically SH/D<0.1%|S_H|/D<0.1\%) receive nonzero weights θj>0\theta_j>0, forming the set of H-Neurons SHS_H.

2. Identification Protocols and Predictive Generalization

The canonical methodology for identifying H-Neurons consists of:

  • Dataset Construction: Balanced faithfully factual and fully hallucinatory responses, as in 1,000-trial splits from TriviaQA, are collected with randomized sampling (temperature=1.0, top_k=50, top_p=0.9) and filtered for consistency.
  • Feature Engineering: Per-example vectors x(s)x^{(s)} concatenate the CETT\overline{\mathrm{CETT}} metrics across all FFN neurons for both answer and non-answer spans.
  • Label Assignment and Probe Training: Answer-span features from hallucinated outputs are labeled y=1y=1, and all others y=0y=0. The sparse logistic objective is solved for weights θ\theta.
  • H-Neuron Selection: Neurons with θj>0\theta_j>0 are designated H-Neurons.
  • Evaluation Metrics: Precision, recall, F1, and cross-domain generalization (on NQ-Open, BioASQ, NonExist) are reported.

A critical empirical finding is that using fewer than 0.1%0.1\% of all FFN units, the probe achieves accuracy gains from 61\sim 6168%68\% (baseline) to 76\sim 7683%83\% on held-out and out-of-domain hallucination detection tasks.

3. Behavioral Causality and Intervention Studies

Direct manipulation of H-Neuron activations establishes their causal influence on model behavior:

  • Activation Scaling: Pre-nonlinearity activation of H-Neuron jj at token tt is re-scaled: zj,tαzj,tz_{j,t} \leftarrow \alpha \cdot z_{j,t}, α[0,3]\alpha \in [0,3]. The corresponding causal contribution scales linearly: CETTj,t(α)αCETTj,t\mathrm{CETT}_{j,t}(\alpha) \approx \alpha\,\mathrm{CETT}_{j,t}.
  • Compliance Benchmarks: Behavioral impact is measured across benchmarks targeting over-compliance, including FalseQA, FaithEval, Sycophancy, and Jailbreak.
  • Quantitative Effects: Compliance rate increases monotonically with α>1\alpha>1 and decreases for α<1\alpha<1; average compliance slope dComplianceRate/dα3.03d\textrm{ComplianceRate}/d\alpha\approx 3.03 (small models) and $2.40$ (large models). One-sided tt-tests confirm the statistical significance (p<0.001p<0.001 for most α1\alpha\neq 1).

This establishes H-Neurons as a direct cause of hallucinatory and over-compliant behaviors.

4. Tracing Neural Origins: Pre-training versus Instruction Tuning

A core finding is that H-Neurons originate primarily during pre-training and are largely unaffected by downstream instruction tuning:

  • Backward Transferability: A probe trained on instruction-tuned activations (θ\theta) is directly applied to base (pre-SFT) model activations, achieving AUROC 0.5\gg 0.5 (up to 0.86\sim 0.86 on TriviaQA), confirming predictive value absent alignment data.
  • Drift Quantification: For each neuron jj, projection weight drift is measured: Δjup=1cos(Wup,jbase,Wup,jchat)\Delta_j^{up}=1-\cos(W_{\mathrm{up},j}^{\mathrm{base}}, W_{\mathrm{up},j}^{\mathrm{chat}})

Δjdown=1cos(Wdown,jbase,Wdown,jchat)\Delta_j^{down}=1-\cos(W_{\mathrm{down},j}^{\mathrm{base}}, W_{\mathrm{down},j}^{\mathrm{chat}})

Aggregate drift Δj\Delta_j is z-normalized and rank-normalized; H-Neurons cluster at high rjr_j (mean >0.58>0.58, p<0.001p<0.001), indicating minimal alteration post-alignment. This suggests the emergence of “compliance” circuits is largely a consequence of the foundational training, not alignment stages.

5. Applied Implications and Mitigation Approaches

Key findings on H-Neurons directly inform strategies for detection and mitigation of hallucinations:

  • Neuronal Probes: Lightweight, model-agnostic probes leveraging only the sparse H-Neuron set can serve as efficient hallucination detectors for LLM outputs.
  • Activation Suppression: Real-time suppression (α<1\alpha<1) of these units reduces both hallucination and over-compliance across evaluation sets. However, uniform scaling impairs model helpfulness, indicating a need for selective or task-sensitive modulation strategies.
  • Architectural Interventions: Recommendations include:
    • Dynamic gating or mask layers to down-weight H-Neurons when accuracy is required,
    • Regularization during pre-training to discourage over-dependence on compliance-encoding neurons,
    • Modification of pre-training objectives through calibration losses or uncertainty penalties to disfavor formation of these “compliance” circuits.

A summary of H-Neuron research axes and major results is presented below:

Dimension Finding Quantitative Summary
Sparsity SH/D<0.1%|S_H|/D<0.1\% Probe using \sim0.1% neurons
Generalization AUROC 0.5\gg 0.5 across domains Up to 0.86\sim0.86 (TriviaQA)
Causal Impact Monotonic compliance increase with α\alpha ddCompliance/dα/d\alpha\sim2.4–3.0
Origin Pre-training phase High rjr_j with minimal drift

6. Theoretical and Practical Significance

H-Neurons provide a mechanistic linkage between single-unit FFN dynamics and global LLM failure modes, reconciling macroscopic over-compliance behaviors with microscopic neural substrates. This substantiates a neuron-level “compliance bias” that is upstream of supervised alignment, challenging the assumption that hallucinations are solely byproducts of post-pretraining tuning or data quality.

*A plausible implication is that robust factuality will not be achieved solely by alignment or prompting, but may require architectural and pre-training design changes to disrupt compliance-related neural circuits before their consolidation.

7. Open Directions and Future Work

Current suppression methods for H-Neuron activity, while effective at reducing hallucination and over-compliance, are blunt, sometimes degrading answer helpfulness. Future research directions include the development of finer-grained neuron editing or gating systems, neuron-level regularization during training, and a deeper exploration of the interaction between H-Neuron dynamics and model scaling laws. The potential for targeted architectural interventions, such as adaptive mask layers or uncertainty-guided gating, is currently under investigation (Gao et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hallucination-Associated Neurons (H-Neurons).