Value Neurons in Neural Networks
- Value neurons are neural units that quantitatively encode value-related information by exhibiting distinct activation patterns in response to value-eliciting prompts.
- Contrastive activation analysis and game-theoretic valuation methods, such as Shapley values and variational inference, precisely identify and rank these critical units.
- Their targeted manipulation underpins applications in ethical alignment, model compression, and interpretability, with emerging roles in quantum neural computation.
A value neuron is a neural unit whose activity is quantitatively linked to the encoding, computation, or control of value-related information. In neuroscience and artificial neural network research, the term can denote two related, but distinct, concepts: (1) neurons encoding explicit value signals—such as in interpretable machine learning or in the context of value-laden concepts (e.g., ethics, reward, or attribution) in LLMs; and (2) neurons whose mathematical contribution or “value” to a model’s output can be quantified, such as via game-theoretic, information-theoretic, or variational criteria. Recent advances have expanded this notion to include value-critical units identifiable via contrastive mechanistic analysis, game-theoretic valuation (e.g., Shapley values), and task-aligned selection. Value neurons thus play central roles in mechanistic interpretability, model compression, ethical alignment, and quantum neural computation.
1. Psychological and Conceptual Foundations
Recent work anchors LLM value neurons in universal psychological constructs. The ValueLocate framework leverages Schwartz’s theory, which posits four meta-dimensions of human values: Openness to Change, Self-Transcendence, Conservation, and Self-Enhancement. Each dimension decomposes into sub-values and atomic constituents (e.g., Self-Direction → Creativity, Freedom) (Su et al., 23 May 2025). These value taxonomies can be programmatically operationalized in AI benchmarks and prompt engineering, enabling systematic study of how such human-aligned values are encoded or can be steered in large models.
2. Quantifying Value Neurons in LLMs
The identification and control of value neurons within LLMs is addressed via the ValueLocate interpretability framework (Su et al., 23 May 2025). Value neurons are operationally defined as feedforward network (FFN) units within transformer layers whose activation statistics differ significantly in response to prompts structured to elicit (or negate) a target value.
The neuron activation , for neuron in layer and input , is computed across a set of prompts reflecting both a given value and its antonym. The difference in activation probability,
permits mechanistic localization: those neurons with (positive-value neurons) or (negative-value neurons) are functionally value-critical. This contrastive activation analysis avoids the need for heavy attribution analysis and is scalable to modern LLMs.
Causal editing of identified value neurons—by scaling their activations during inference—enables precise control of generated output values (e.g., amplifying or inverting “Conservation” or “Self-Enhancement”), with empirical evidence supporting causal influence. Only of neurons per layer are typically tagged as value-related, clustering in middle layers (e.g., ~layer 15 in LLama-3.1-8B).
3. Game-Theoretic and Probabilistic Valuation in Deep Networks
A complementary perspective defines “value” in terms of quantitative importance to model performance. Two principal approaches arise:
- Shapley Value Framework: Each neuron’s contribution is measured as its average marginal impact on a chosen performance metric (e.g., accuracy) across all possible coalitions (subsets) of neurons. The mathematical Shapley value,
captures not just individual activation but higher-order dependencies (Adamczewski et al., 2019, Ghorbani et al., 2020).
- Variational Importance-Switch: A probabilistic “switch” —a sparse, Dirichlet-distributed vector—scales each neuron in a layer, and variational inference maximizes the evidence lower bound (ELBO) to learn the most important neurons (Adamczewski et al., 2019). The resulting can be taken as a soft-value of neuron .
Both methods yield highly correlated neuron importance rankings; empirical studies show the top-k neurons by Shapley or importance-switch typically overlap in low-level and high-level CNN layers, confirming the framework validity (Adamczewski et al., 2019). Removal of the highest-value neurons by either method results in rapid model degradation (e.g., removing 30 critical filters in Inception-v3 reduces accuracy to near chance) (Ghorbani et al., 2020).
4. Mechanistic and Causal Analysis
Targeted manipulation of value neurons provides direct evidence for their causal role in encoding and controlling model behavior. In LLMs, amplifying positive-value neurons and suppressing negative-value neurons—using dynamic scaling or zeroing based on activation difference thresholds—not only shifts model outputs along the intended value dimension but does so with fine granularity, as demonstrated by near-linear dose–response relationships (Su et al., 23 May 2025).
In vision models, the Neuron Shapley approach enables fine-grained intervention: ablation of class-specific or bias-inducing value neurons selectively impairs associated predictions (e.g., fairness or adversarial vulnerability), while preserving overall accuracy if pruned judiciously (Ghorbani et al., 2020).
5. Applications: Alignment, Compression, and Interpretability
Value neurons are central to several applied domains:
- Ethical and Value Alignment: By localizing value neurons and editing activations at inference, one can steer LLMs to express, amplify, suppress, or invert specific value-laden attitudes, measured via scenario-based benchmarks (e.g., the ValueInsight dataset) (Su et al., 23 May 2025).
- Model Compression: Value quantification enables principled pruning. Structured removal of low-importance neurons—by Shapley or variational criteria—yields compressed models with negligible or no drop in accuracy (70%+ reduction in parameters for LeNet-5 on MNIST, 50%+ in VGG-16 on CIFAR-10) (Adamczewski et al., 2019).
- Robustness and Fairness Repair: Targeted ablation of neurons with negative impact on fairness or robustness rapidly improves group accuracy or adversarial resistance, with minimal accuracy tradeoff (Ghorbani et al., 2020).
- Interpretability: Visualizations of top-value neurons reveal that high-importance units encode class-discriminative patterns, texture primitives, or value-aligned semantic distinctions.
6. Specialized Constructions: Multi-Valued Quantum Neurons
In quantum neural networks, a distinct class of value neurons arises. Multi-Valued Quantum Neurons (MVQN) represent and compute using truth values mapped to the th roots of unity on the complex unit circle (AlMasri, 2023). Each neuron conducts threshold logic over this set, encoding inputs, weights, and outputs as complex phases. Activation is implemented via phase-normalization:
Weight updates correspond to metric motion on the unit circle; MVQNs demonstrate fast convergence and high functional expressiveness (e.g., solving XOR and other multi-level logic in one layer), and physical realizations are proposed via OAM of light or multi-level spin qudits.
7. Limitations and Future Directions
The computational burden for rigorous value assignment remains significant—especially for Shapley-based evaluation, which may require hours of evaluation time on large models (Ghorbani et al., 2020). Most current frameworks address global or dataset-level neuron valuations rather than per-example or context-dependent explanations. There is scope for research into more efficient estimation, integration of per-example analysis, and extension to other model primitives (e.g., attention heads).
A plausible implication is that as models and applications increase in complexity, automated identification and control of functionally critical value neurons will be essential for scalable interpretability, ethical alignment, and robust deployment. Continued development in mechanistic and game-theoretic neuron quantification will likely drive both theoretical advances and practical tools in responsible machine learning.