Shared Safety Neurons in LLM Safety
- Shared Safety Neurons (SS-Neurons) are defined by distinct activation patterns that differentiate benign from harmful inputs, acting as a safety filter in LLMs.
- They are identified through systematic activation contrasting and causal ablation using metrics such as difference-of-means, logistic regression weights, and outlier z-scores.
- Empirical studies show that manipulating less than 0.3% of these neurons can dramatically impact safety, offering both targeted realignment opportunities and exposing vulnerabilities.
Shared Safety Neurons (SS-Neurons) are a sparse, causally significant subset of neurons within LLMs that mediate safety-aligned behaviors—specifically, the detection and suppression of harmful or unsafe outputs. Notably, SS-Neurons are characterized by their transferability across models, languages, and tasks as the critical locus of safety alignment, enabling both new interpretability-based safety mechanisms and presenting new vulnerabilities in neural architectures.
1. Definitions and Formal Properties
Safety neurons are defined by distinct activation patterns that differentiate benign from malicious inputs. For a feed-forward sublayer ℓ in a transformer, neuron is a safety neuron if the expected activation under a set of malicious prompts exceeds that under benign prompts by a defined threshold:
Shared Safety Neurons (SS-Neurons) are defined as the intersection of safety neuron sets identified in two models (or languages) with shared architecture and alignment protocols, i.e., they are those safety neurons whose indices are preserved across variants:
or, in the multilingual case, as the intersection of monolingual safety neurons between a high-resource and a non-high-resource language:
where denotes monolingual safety neurons for language (Wu et al., 15 Sep 2025, Zhang et al., 1 Feb 2026, Chen et al., 2024).
2. Identification Methodologies for SS-Neurons
Identification of SS-Neurons proceeds via systematic activation contrasting and causal ablation. White-box frameworks assemble balanced sets of benign and malicious prompts, extract neuron activations at specific layers, and train linear probes or logistic regressors to separate the two classes. Key metrics for selection are:
- Difference-of-means (): quantifies the activation shift between malicious and benign prompt distributions.
- Logistic regression weights (): direct measure of a neuron's contribution to safety discrimination.
- Outlier z-score (): statistical prominence of safety neuron weights.
SS-Neurons are thus identified as those whose indices (or causal importance) persist when the above methodology is applied to multiple aligned models or languages. Cross-lingual SS-Neurons are specifically those neurons whose removal impairs safety refusals in both high-resource (e.g., English) and non-high-resource languages (Zhang et al., 1 Feb 2026), and whose activations can be traced to a core subset (<0.3% of total neurons).
Generation-time activation contrasting, as introduced in mechanistic interpretability, tracks per-token activations of candidate neurons between SFT and safety-aligned models to define a "change score," enabling ranking and sparse selection of the top-K neurons. Dynamic activation patching then empirically verifies causality on open-ended generations (Chen et al., 2024).
3. Causal Role, Sparsity Patterns, and Empirical Validation
The causal role of SS-Neurons is confirmed via masking and activation patching experiments. Deactivating only the SS-Neurons during inference leads to drastic safety drops, as measured by Attack Success Rate (ASR) on red-teaming tasks; for example, masking <0.6% of MLP neurons raises ASR to 74.4% in certain LLaMA variants (Wu et al., 15 Sep 2025). Crucially, masking randomly selected neurons of equal number causes negligible effect (see Table 1).
Sparsity patterns are highly consistent: for MLPs with width –, only 15–60 neurons per layer are labeled as safety neurons (≃0.3–0.6%). For cross-lingual SS-Neurons, the core set per language is typically under 0.3% of all neurons, yet their suppression is sufficient to significantly compromise safety across multiple languages (Zhang et al., 1 Feb 2026).
This extremely low-rank structure enables targeted safety realignment, as patching only the identified safety neurons can recover over 90% of the safety gap between SFT and DPO models, as demonstrated across Llama, Mistral, and Gemma architectures (Chen et al., 2024).
4. Transferability Across Models, Languages, and Tasks
Empirical studies demonstrate that SS-Neurons exhibit robust transferability across model variants, languages, and even tasks. Pruning (or patching) SS-Neurons identified in a base model reliably induces safety degradation (or restoration) in fine-tuned and distilled sibling models, with increases in ASR from <5% to >75% in multiple families (Wu et al., 15 Sep 2025).
In the multilingual setting, ablation of SS-Neurons in one non-high-resource language (NHR) propagates safety degradation across all other NHR languages, indicating that the shared neurons act as cross-lingual safety bridges (Zhang et al., 1 Feb 2026). Neuron patches identified for one downstream task transfer with minimal safety loss to other tasks, supporting the universality of these pathways (Yi et al., 2024).
This transferability is underpinned by the one-to-one index mapping of neurons in shared MLP blocks and is further supported by overlap and ranking correlations: ∼75% overlap between safety and helpfulness neuron sets, with Spearman ρ ≈ 0.6 for their change-score rankings (Chen et al., 2024).
5. Applications: Attacks, Defenses, and Realignment
The identification and transfer properties of SS-Neurons underpin both powerful attack and defense strategies:
Attack Vectors: NeuroStrike exploits the concentration of safety mechanisms in SS-Neurons to enable reliable white-box and black-box jailbreaks. By deactivating these neurons, ASR against aligned LLMs is increased up to 76.9%. Transferability allows adversaries to pre-compute SS-Neurons on public models and attack proprietary or multilingual systems, with black-box profiling achieving 63.7% ASR even without direct model access (Wu et al., 15 Sep 2025).
Defensive Interventions: SS-Neurons facilitate parameter-efficient realignment interventions. Targeted expansion (gradient-masked fine-tuning) of SS-Neurons—by updating only weights associated with the shared subset—can reduce ASR across NHR languages by >60% relative to prior methods, while maintaining general language capabilities (MGSM, MMLU) (Zhang et al., 1 Feb 2026). In analogy, NLSR leverages SS-Neurons for non-gradient patch-based recovery of safety in fine-tuning–corrupted LLMs, restoring safety with negligible or even improved downstream task performance and outperforming both SafeLoRA and fine-tuning defenses (Yi et al., 2024). Only about 0.5–1% of parameters require updating.
Interpretability and Monitoring: Pre-generation unsafe-output detection using SS-Neuron activations enables logistic classifier accuracy of ≈79.4% on held-out safety benchmarks, exceeding baseline rates and enabling statistical safeguards pre-inference (Chen et al., 2024).
6. Theoretical and Practical Implications
The concentration of safety-aligned behavior in a minimal subset of shared neurons constitutes both a centralization risk—creating a single point of failure exploitable by adversaries—and an opportunity for efficient, high-precision alignment. Overlap with helpfulness neurons introduces an "alignment tax," as the same neuron set must mediate conflicting activation patterns for safety and assistive behaviors, and forced sharing of patterns can degrade one dimension when enhancing the other (Chen et al., 2024).
Defense strategies emphasize diffusion of safety signals (multi-objective alignment), architectural diversification (Mixture-of-Experts, random sublayer gating), and runtime or attestation-based monitoring of the SS-Neuron set to prevent targeted manipulation (Wu et al., 15 Sep 2025).
Practical guidelines for SS-Neuron-based safety alignment are reproducible: identify MS- and SS-Neurons via activation contrasting and causal scoring, construct parallel multilingual supervision, apply gradient masks (or patch-based copying), and monitor ASR and benchmark scores to validate effectiveness (Zhang et al., 1 Feb 2026).
7. Limitations and Open Questions
Current methodologies are well-established for "fundamental safety" (violence, illegal content), but extension to culturally nuanced or reasoning-aligned domains remains an active research direction (Zhang et al., 1 Feb 2026). Translation-based parallel supervision can introduce artifacts; direct multilingual or unsupervised methodologies may provide more robust alignment for diverse languages and tasks.
A plausible implication is that the presence of robust, transferable SS-Neurons could serve as a diagnostic for the mechanistic interpretability and alignment potential of contemporary LLMs, but also poses regulatory and operational challenges as model weights and architecture converge across deployment environments.
Key References:
- "NeuroStrike: Neuron-Level Attacks on Aligned LLMs" (Wu et al., 15 Sep 2025)
- "Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons" (Zhang et al., 1 Feb 2026)
- "NLSR: Neuron-Level Safety Realignment of LLMs Against Harmful Fine-Tuning" (Yi et al., 2024)
- "Finding Safety Neurons in LLMs" (Chen et al., 2024)