Papers
Topics
Authors
Recent
2000 character limit reached

H-Neurons: Neuron-Level Drivers of Hallucinations

Updated 11 January 2026
  • H-Neurons are a sparse subset of feed-forward neurons identified via high CETT scores that causally drive hallucinations in LLM outputs.
  • The neurons are isolated using ℓ1-regularized logistic regression and pre-activation scaling, achieving notably higher detection accuracy (e.g., 78.4% vs. 61.7% on Mistral-7B).
  • Targeted manipulation of H-Neurons offers promising interventions to mitigate hallucinations while preserving generation fluency and reliability.

Hallucination-associated neurons (“H-Neurons”) are a distinct, sparse subset of feed-forward network (FFN) neurons within LLMs whose activation patterns directly predict and causally drive the occurrence of hallucinations—outputs that are plausible but factually incorrect. Unlike analysis at the whole-model or dataset level, H-Neurons provide a microscopic substrate connecting neuron-level dynamics to macroscopic unreliability phenomena, and enable advances in hallucination detection, mechanistic understanding, and system intervention (Gao et al., 1 Dec 2025).

1. Formal Definition and Identification

H-Neurons are formally defined as FFN neurons whose partial output contributions—when aggregated over relevant answer span tokens—yield high predictive value for hallucination occurrence. For neuron jj in layer \ell at token tt, its individual down-projected contribution is ht(j)=Wdownzt(j)h_t^{(j)} = W_{\mathrm{down}} z_t^{(j)}, with the CETT (Causal-Effect Token-level Transfer) ratio: CETTj,t=ht(j)2ht2,ht=Wdownzt\mathrm{CETT}_{j,t} = \frac{\| h_t^{(j)} \|_2}{\| h_t \|_2 }, \quad h_t = W_{\mathrm{down}} z_t For each sample, mean CETT scores are computed over answer tokens (AA) and non-answer tokens: CETTj,answer=1AtACETTj,t\overline{\mathrm{CETT}_{j,\mathrm{answer}}} = \frac{1}{|A|} \sum_{t \in A} \mathrm{CETT}_{j,t} A feature vector of mean CETT scores across neurons forms the input to an 1\ell_1-regularized logistic regression: Pr(y=1x)=σ(θTx)\Pr(y=1 \mid x) = \sigma(\theta^T x) Strong regularization ensures sparsity; neurons with θj>0\theta_j > 0 after training are designated H-Neurons (Gao et al., 1 Dec 2025).

The selection process yields an extremely sparse set, typically 0.010.350.01\text{–}0.35 per mille of all neurons, with thresholding and grid search over regularization strength to maximize held-out hallucination-detection accuracy and functional safety.

2. Quantitative Characterization

Empirical analysis across six LLM architectures (e.g. Mistral-7B-v0.3, Gemma-3-4B, Llama-3.3-70B) demonstrates the following:

  • Sparsity: H-Neurons constitute less than 0.1%0.1\% of model neurons.
  • Predictive Power: Hallucination detection accuracy is substantially higher when using H-Neurons versus random neuron subsets. For instance, Mistral-7B-v0.3 achieves 78.4% accuracy vs. 61.7% for random selection (TriviaQA benchmark).
  • Robustness: The elevated predictive accuracy holds across in-domain (TriviaQA, NQ-Open), cross-domain (BioASQ), and fabricated (NonExist) settings.
Model Ratio (‰) TriviaQA (H) TriviaQA (Rand)
Mistral-7B-v0.3 0.35 78.4 61.7
Llama-3.3-70B 0.01 82.7 68.4

The rank-distribution of selected θj\theta_j parameters is sharply concentrated, indicating only a minuscule “tail” of neurons encodes the hallucination signal (Gao et al., 1 Dec 2025).

3. Causal Manipulation and Behavioral Impact

Causality is established via pre-activation scaling. For each H-Neuron, zj,tαzj,tz_{j,t} \leftarrow \alpha \cdot z_{j,t} (α[0,3]\alpha \in [0,3]) linearly modulates its CETT contribution: CETTj,t(α)=αCETTj,t\mathrm{CETT}_{j,t}(\alpha) = \alpha \cdot \mathrm{CETT}_{j,t} Increasing α\alpha increases the rate of hallucination and over-compliance behaviors; decreasing α\alpha suppresses both. Benchmarks across four compliance tasks (FalseQA, FaithEval, Sycophancy, Jailbreak) show a consistent monotonic relationship:

  • Mistral-7B-v0.3: Compliance rises from 45% (α=1\alpha=1) to 75% (α=3\alpha=3) in FalseQA.
  • Larger models exhibit lower slopes than smaller ones, indicating divergent robustness profiles.

This causal relationship affirms that H-Neurons encode a generic over-compliance mechanism, not restricted to isolated contextual hallucination (Gao et al., 1 Dec 2025). Excessive suppression impairs fluency, confirming functional entanglement with general generation circuitry.

4. Origins and Transferability

Origin analysis reveals that H-Neurons are already present—with nearly identical parameterizations—in base (pre-trained, unaligned) models. Application of θ\theta vectors learned on instruction-tuned models to their pre-trained counterparts yields high AUROC for hallucination prediction (e.g., AUROC~0.86 on TriviaQA for Mistral-7B).

Cosine distance analysis of up/down-projection weights shows H-Neurons have minimal drift during alignment, clustering at high normalized-rank for parameter preservation (\approx0.97). This pattern is robust across all examined families (p<0.001), indicating that H-Neurons emerge during pre-training—not as a product of supervised fine-tuning or RLHF—and that the neural substrate predisposing LLMs toward hallucination is fundamental in the next-token prediction regime (Gao et al., 1 Dec 2025).

5. Comparative Approaches: Hallucination Estimation via Internal States

Complementary work explores the use of neuron-level activations for pre-response hallucination estimation. Probing estimators leveraging the internal-state embedding of the last query token in deep layers (≥25) predict hallucination risk with 84.32% accuracy across 15 diverse NLG tasks (Ji et al., 2024). Neurons ranked by mutual information with hallucination labels—top-kk (typically k=8k=8)—enable calibratable, low-overhead, and real-time risk predictors. These neurons function analogously to metacognitive uncertainty signals, furnishing mechanisms for self-monitoring and automated retrieval intervention. The probe architecture is a small MLP of the form: H(xq)=down(up(xq)SiLU(gate(xq)))H(x_q) = \text{down}(\text{up}(x_q) \odot \mathrm{SiLU}(\mathrm{gate}(x_q))) with standard cross-entropy training and evaluation by per-task F1/Accuracy (Ji et al., 2024).

6. Implications for Mitigation, Theory, and Future Systems

  • Detection: H-Neurons provide compact, high-SNR signals, generalizing across domains and models.
  • Intervention: Direct neuron suppression (downscale α\alpha) reliably reduces hallucinations and over-compliance; dynamic, context-aware gating is needed to avoid loss of fluency.
  • Theoretical Insight: The emergence of H-Neurons during pre-training, and their linkage to over-compliance, supports the hypothesis that hallucination is a fundamental side-effect of the next-token training objective and not merely a misalignment artifact (Gao et al., 1 Dec 2025).
  • Practical Systems: Neural substrate monitoring enables token-level alarms, early warning of hallucination risk before output generation, and automatic triggers for retrieval augmentation or human-in-the-loop escalation (Ji et al., 2024).

A plausible implication is that neuron-level interventions—such as conditional gating, learnable masks, or editing outgoing weights—may become practical routes to robustly mitigate undesired generation behaviors, including hallucinations, bias, and sycophancy.

7. Relation to Multi-Modal and Concept Neurons

Analysis of multi-modal transformers further highlights neuron subsets with sensitivity, specificity, and causal effect in driving semantic or concept-specific outputs (e.g., controlling nouns in image captions by editing a small set of neurons) (Pan et al., 2023). Although direct hallucination ablation is unstudied in these works, the identification and targeted manipulation of concept neurons reinforce the principle that microscopic neuronal control suffices for precise output modulation, generalizes across inputs and modalities, and preserves global model parameters.

The convergence of findings on H-Neurons and multimodal concept neurons suggests a broader landscape of interpretable, actionable neuron-level mechanisms underpinning reliability and controllability in deep model architectures.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to H-Neurons.