TaylorDecomp-FFN: Interpreting FFN Contributions
- The paper introduces TaylorDecomp-FFN, a method that decomposes Transformer FFN layers into a sum of interpretable sub-updates, clarifying each neuron's impact on token predictions.
- It applies a principled Taylor approximation to attribute changes in output distributions to individual FFN neurons, demonstrating improved zero-shot toxicity reduction and inference efficiency.
- The approach enables direct neuron-level interventions and early exit strategies, offering actionable insights to control semantic and syntactic behavior in language models.
TaylorDecomp-FFN is a principled method for decomposing the effect of Transformer feed-forward network (FFN) layers into a sum of interpretable sub-updates in vocabulary space, enabling a fine-grained, mechanistic understanding of how each FFN neuron contributes to LLM output. This approach formalizes the additive impact of FFN layers on token prediction distributions and allows for semantic and syntactic interpretation of model internals as well as direct interventions in model behavior, including zero-shot toxicity reduction and inference-time efficiency improvements (Geva et al., 2022).
1. Mathematical Formulation of FFN Decomposition
The TaylorDecomp-FFN methodology operates on the standard Transformer FFN sub-layer for a given layer . Let denote the hidden state of a token, and the embedding/output projection (with the vocabulary size). The model’s output distribution is , with for each token .
The FFN computation is:
- , where is a nonlinear activation (e.g., ReLU, GeLU)
- (post-layernorm convention)
The next-token logits are and with . The resultant update in output distribution is . The key insight is that is a linear combination of value vectors (columns of ), each weighted by a coordinate :
Projecting this into vocabulary space:
where is defined as the th sub-update.
If one insists on a local (first-order) Taylor approximation of the distributional update:
but exact, non-linear softmax updates are preferred in practice.
2. Concept Neurons and Interpretability
Each FFN neuron (column vector in ) induces a concept vector , effectively ranking vocabulary elements. The magnitude dynamically modulates the contribution of this concept for the given input , resulting in a per-token additive update to the logits:
Inspection of the top elements in shows that many neurons specialize in promoting coherent sets of vocabulary tokens corresponding to semantic (“breakfast foods,” “animal names,” “pronouns”) or syntactic (“-ed” plurals, determiners) concepts. Thus, TaylorDecomp-FFN enables per-neuron analysis and “naming” of latent concepts within model parameters.
3. Practical Algorithm and Computational Efficiency
The decomposition enables a computationally efficient, interpretable workflow. The process involves:
- Precomputation: For each layer and neuron, compute and store the top tokens.
- Inference per-token:
- Compute .
- Select the indices with largest for dominant sub-updates.
- Accumulate for chosen .
- Update logits and evaluate the new output distribution via softmax.
Efficiency notes:
- Only dominant (top-) sub-updates need be materialized, reducing computation to .
- Storage of full vectors can be avoided for many applications by caching only top token IDs and scores, facilitating concept “naming” without full vector materialization.
Pseudocode is provided in (Geva et al., 2022), enabling straightforward replication.
4. Empirical Results: Control and Efficiency
Two main experimental results substantiate the utility of TaylorDecomp-FFN:
a) Zero-shot Toxicity Reduction in GPT-2:
By identifying 10 “harmless” concept neurons (those with non-toxic top tokens in ) and forcing at every FFN layer, toxic outputs from GPT-2 are reduced by approximately 50%, as measured over 1,225 “hard” prompts (REALTOXICPROMPTS, Perspective API). Baseline self-debiasing approaches achieve only 13% reduction with the same parameter budget, and post-hoc word filters provide 0%. The associated perplexity increase is mild (≈ 15% relative). Performance metrics are averaged across 6 toxicity assessment criteria (Table 5 in (Geva et al., 2022)).
b) Early Exit via Dominant Sub-Updates (WIKILM):
A token is designated as “saturated” in layer if its final prediction is already rank 1 at that stage. Employing a nearest-neighbor rule to match dominant neuron clusters leads to “early-exit” with 94.1% exit accuracy and 20% average layer savings. This approach matches or outperforms logistic regression classifiers on hidden state or FFN outputs, but requires no further training (Table 6 in (Geva et al., 2022)).
5. Limitations and Directions for Extension
Several limitations and open avenues are identified:
- Application is restricted to decoder-only LMs (GPT-style) and a single encoder LM (WIKILM). Encoder-only and masked-LM architectures (BERT/RoBERTa) remain unexplored.
- The decomposition treats value vectors individually; compositional interactions (pairwise or higher-order) may play a role but their analysis is exponentially more complex.
- Human annotation of concepts is nontrivial due to rare words, subwords, and world knowledge; current coverage may be underestimated.
- Interventions alter only the exposure of information (logit weighting), not the underlying model knowledge. For example, reducing toxic outputs suppresses, but does not erase, such content.
- Prospective work includes “editing” value vectors to remove undesirable concepts, or integrating adaptive re-scaling of sub-updates (curriculum schemes) during fine-tuning.
6. Conceptual and Methodological Significance
TaylorDecomp-FFN provides a granular, per-neuron analysis of prediction construction in Transformer models, converting opaque FFN parameterizations into additive, interpretable contributions in vocabulary space. This approach elucidates the architectural mechanisms behind concept promotion or demotion at each layer, enabling real-time, zero-shot interventions for both safety (e.g., toxicity reduction) and computational efficiency (e.g., early exit). The method offers a unifying framework for understanding and manipulating FFN-driven semantic and syntactic abstraction in neural LLMs (Geva et al., 2022).