TaylorDecomp-FFN: Interpreting FFN Contributions

Updated 26 January 2026

The paper introduces TaylorDecomp-FFN, a method that decomposes Transformer FFN layers into a sum of interpretable sub-updates, clarifying each neuron's impact on token predictions.
It applies a principled Taylor approximation to attribute changes in output distributions to individual FFN neurons, demonstrating improved zero-shot toxicity reduction and inference efficiency.
The approach enables direct neuron-level interventions and early exit strategies, offering actionable insights to control semantic and syntactic behavior in language models.

TaylorDecomp-FFN is a principled method for decomposing the effect of Transformer feed-forward network (FFN) layers into a sum of interpretable sub-updates in vocabulary space, enabling a fine-grained, mechanistic understanding of how each FFN neuron contributes to LLM output. This approach formalizes the additive impact of FFN layers on token prediction distributions and allows for semantic and syntactic interpretation of model internals as well as direct interventions in model behavior, including zero-shot toxicity reduction and inference-time efficiency improvements (Geva et al., 2022).

1. Mathematical Formulation of FFN Decomposition

The TaylorDecomp-FFN methodology operates on the standard Transformer FFN sub-layer for a given layer $\ell$ . Let $x^{\ell} \in \mathbb{R}^d$ denote the hidden state of a token, and $E \in \mathbb{R}^{|V| \times d}$ the embedding/output projection (with $|V|$ the vocabulary size). The model’s output distribution is $p^{\ell} = \operatorname{softmax}(E x^{\ell}) \in \mathbb{R}^{|V|}$ , with $p^\ell[w] \propto \exp(e_w^\top x^\ell)$ for each token $w$ .

The FFN computation is:

$m^\ell = \phi(W_1^\ell x^\ell + b_1^\ell) \in \mathbb{R}^{d_m}$ , where $\phi$ is a nonlinear activation (e.g., ReLU, GeLU)
$o^\ell = W_2^\ell m^\ell + b_2^\ell \in \mathbb{R}^d$
$x^{\ell+1} = x^\ell + \operatorname{LayerNorm}(o^\ell)$ (post-layernorm convention)

The next-token logits are $u^\ell = E x^\ell$ and $u^{\ell+1} = u^\ell + \Delta u^\ell$ with $\Delta u^\ell = E o^\ell$ . The resultant update in output distribution is $\Delta p^\ell = p^{\ell+1} - p^\ell$ . The key insight is that $o^\ell$ is a linear combination of $d_m$ value vectors $v_i^\ell$ (columns of $W_2^\ell$ ), each weighted by a coordinate $m_i^\ell$ :

$o^\ell = \sum_{i=1}^{d_m} m_i^\ell v_i^\ell$

Projecting this into vocabulary space:

$\Delta u^\ell = E o^\ell = \sum_{i=1}^{d_m} m_i^\ell (E v_i^\ell) = \sum_{i=1}^{d_m} \Delta u_i^\ell$

where $\Delta u_i^\ell := m_i^\ell (E v_i^\ell)$ is defined as the $i$ th sub-update.

If one insists on a local (first-order) Taylor approximation of the distributional update:

$\Delta p^\ell \approx J_{\operatorname{softmax}}(u^\ell) \Delta u^\ell = \sum_i J_{\operatorname{softmax}}(u^\ell) \Delta u_i^\ell$

but exact, non-linear softmax updates are preferred in practice.

2. Concept Neurons and Interpretability

Each FFN neuron (column vector $v_i^\ell$ in $W_2^\ell$ ) induces a concept vector $r_i^\ell = E v_i^\ell \in \mathbb{R}^{|V|}$ , effectively ranking vocabulary elements. The magnitude $m_i^\ell$ dynamically modulates the contribution of this concept for the given input $x^\ell$ , resulting in a per-token additive update to the logits:

$\Delta u_i^\ell = m_i^\ell r_i^\ell$

Inspection of the top elements in $r_i^\ell$ shows that many neurons specialize in promoting coherent sets of vocabulary tokens corresponding to semantic (“breakfast foods,” “animal names,” “pronouns”) or syntactic (“-ed” plurals, determiners) concepts. Thus, TaylorDecomp-FFN enables per-neuron analysis and “naming” of latent concepts within model parameters.

3. Practical Algorithm and Computational Efficiency

The decomposition enables a computationally efficient, interpretable workflow. The process involves:

Precomputation: For each layer and neuron, compute $r_i^\ell = E v_i^\ell$ and store the top $K$ tokens.
Inference per-token:
- Compute $m^\ell = \phi(W_1^\ell x^\ell + b_1^\ell)$ .
- Select the $k$ indices with largest $|m_i^\ell|$ for dominant sub-updates.
- Accumulate $\Delta u^\ell = \sum_{i \in S} m_i^\ell r_i^\ell$ for chosen $S$ .
- Update logits and evaluate the new output distribution via softmax.

Efficiency notes:

Only dominant (top- $k$ ) sub-updates need be materialized, reducing computation to $\mathcal{O}(k|V|)$ .
Storage of full $r_i^\ell$ vectors can be avoided for many applications by caching only top token IDs and scores, facilitating concept “naming” without full vector materialization.

Pseudocode is provided in (Geva et al., 2022), enabling straightforward replication.

4. Empirical Results: Control and Efficiency

Two main experimental results substantiate the utility of TaylorDecomp-FFN:

a) Zero-shot Toxicity Reduction in GPT-2:

By identifying 10 “harmless” concept neurons (those with non-toxic top tokens in $r_i^\ell$ ) and forcing $m_i^\ell \leftarrow +3$ at every FFN layer, toxic outputs from GPT-2 are reduced by approximately 50%, as measured over 1,225 “hard” prompts (REALTOXICPROMPTS, Perspective API). Baseline self-debiasing approaches achieve only 13% reduction with the same parameter budget, and post-hoc word filters provide 0%. The associated perplexity increase is mild (≈ 15% relative). Performance metrics are averaged across 6 toxicity assessment criteria (Table 5 in (Geva et al., 2022)).

b) Early Exit via Dominant Sub-Updates (WIKILM):

A token is designated as “saturated” in layer $\ell$ if its final prediction is already rank 1 at that stage. Employing a nearest-neighbor rule to match dominant neuron clusters leads to “early-exit” with 94.1% exit accuracy and 20% average layer savings. This approach matches or outperforms logistic regression classifiers on hidden state or FFN outputs, but requires no further training (Table 6 in (Geva et al., 2022)).

5. Limitations and Directions for Extension

Several limitations and open avenues are identified:

Application is restricted to decoder-only LMs (GPT-style) and a single encoder LM (WIKILM). Encoder-only and masked-LM architectures (BERT/RoBERTa) remain unexplored.
The decomposition treats value vectors $v_i^\ell$ individually; compositional interactions (pairwise or higher-order) may play a role but their analysis is exponentially more complex.
Human annotation of concepts is nontrivial due to rare words, subwords, and world knowledge; current coverage may be underestimated.
Interventions alter only the exposure of information (logit weighting), not the underlying model knowledge. For example, reducing toxic outputs suppresses, but does not erase, such content.
Prospective work includes “editing” value vectors $v_i^\ell$ to remove undesirable concepts, or integrating adaptive re-scaling of sub-updates (curriculum schemes) during fine-tuning.

6. Conceptual and Methodological Significance

TaylorDecomp-FFN provides a granular, per-neuron analysis of prediction construction in Transformer models, converting opaque FFN parameterizations into additive, interpretable contributions in vocabulary space. This approach elucidates the architectural mechanisms behind concept promotion or demotion at each layer, enabling real-time, zero-shot interventions for both safety (e.g., toxicity reduction) and computational efficiency (e.g., early exit). The method offers a unifying framework for understanding and manipulating FFN-driven semantic and syntactic abstraction in neural LLMs (Geva et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TaylorDecomp-FFN.