Papers
Topics
Authors
Recent
Search
2000 character limit reached

TaylorDecomp-FFN: Interpreting FFN Contributions

Updated 26 January 2026
  • The paper introduces TaylorDecomp-FFN, a method that decomposes Transformer FFN layers into a sum of interpretable sub-updates, clarifying each neuron's impact on token predictions.
  • It applies a principled Taylor approximation to attribute changes in output distributions to individual FFN neurons, demonstrating improved zero-shot toxicity reduction and inference efficiency.
  • The approach enables direct neuron-level interventions and early exit strategies, offering actionable insights to control semantic and syntactic behavior in language models.

TaylorDecomp-FFN is a principled method for decomposing the effect of Transformer feed-forward network (FFN) layers into a sum of interpretable sub-updates in vocabulary space, enabling a fine-grained, mechanistic understanding of how each FFN neuron contributes to LLM output. This approach formalizes the additive impact of FFN layers on token prediction distributions and allows for semantic and syntactic interpretation of model internals as well as direct interventions in model behavior, including zero-shot toxicity reduction and inference-time efficiency improvements (Geva et al., 2022).

1. Mathematical Formulation of FFN Decomposition

The TaylorDecomp-FFN methodology operates on the standard Transformer FFN sub-layer for a given layer \ell. Let xRdx^{\ell} \in \mathbb{R}^d denote the hidden state of a token, and ERV×dE \in \mathbb{R}^{|V| \times d} the embedding/output projection (with V|V| the vocabulary size). The model’s output distribution is p=softmax(Ex)RVp^{\ell} = \operatorname{softmax}(E x^{\ell}) \in \mathbb{R}^{|V|}, with p[w]exp(ewx)p^\ell[w] \propto \exp(e_w^\top x^\ell) for each token ww.

The FFN computation is:

  • m=ϕ(W1x+b1)Rdmm^\ell = \phi(W_1^\ell x^\ell + b_1^\ell) \in \mathbb{R}^{d_m}, where ϕ\phi is a nonlinear activation (e.g., ReLU, GeLU)
  • o=W2m+b2Rdo^\ell = W_2^\ell m^\ell + b_2^\ell \in \mathbb{R}^d
  • x+1=x+LayerNorm(o)x^{\ell+1} = x^\ell + \operatorname{LayerNorm}(o^\ell) (post-layernorm convention)

The next-token logits are u=Exu^\ell = E x^\ell and u+1=u+Δuu^{\ell+1} = u^\ell + \Delta u^\ell with Δu=Eo\Delta u^\ell = E o^\ell. The resultant update in output distribution is Δp=p+1p\Delta p^\ell = p^{\ell+1} - p^\ell. The key insight is that oo^\ell is a linear combination of dmd_m value vectors viv_i^\ell (columns of W2W_2^\ell), each weighted by a coordinate mim_i^\ell:

o=i=1dmmivio^\ell = \sum_{i=1}^{d_m} m_i^\ell v_i^\ell

Projecting this into vocabulary space:

Δu=Eo=i=1dmmi(Evi)=i=1dmΔui\Delta u^\ell = E o^\ell = \sum_{i=1}^{d_m} m_i^\ell (E v_i^\ell) = \sum_{i=1}^{d_m} \Delta u_i^\ell

where Δui:=mi(Evi)\Delta u_i^\ell := m_i^\ell (E v_i^\ell) is defined as the iith sub-update.

If one insists on a local (first-order) Taylor approximation of the distributional update:

ΔpJsoftmax(u)Δu=iJsoftmax(u)Δui\Delta p^\ell \approx J_{\operatorname{softmax}}(u^\ell) \Delta u^\ell = \sum_i J_{\operatorname{softmax}}(u^\ell) \Delta u_i^\ell

but exact, non-linear softmax updates are preferred in practice.

2. Concept Neurons and Interpretability

Each FFN neuron (column vector viv_i^\ell in W2W_2^\ell) induces a concept vector ri=EviRVr_i^\ell = E v_i^\ell \in \mathbb{R}^{|V|}, effectively ranking vocabulary elements. The magnitude mim_i^\ell dynamically modulates the contribution of this concept for the given input xx^\ell, resulting in a per-token additive update to the logits:

Δui=miri\Delta u_i^\ell = m_i^\ell r_i^\ell

Inspection of the top elements in rir_i^\ell shows that many neurons specialize in promoting coherent sets of vocabulary tokens corresponding to semantic (“breakfast foods,” “animal names,” “pronouns”) or syntactic (“-ed” plurals, determiners) concepts. Thus, TaylorDecomp-FFN enables per-neuron analysis and “naming” of latent concepts within model parameters.

3. Practical Algorithm and Computational Efficiency

The decomposition enables a computationally efficient, interpretable workflow. The process involves:

  1. Precomputation: For each layer and neuron, compute ri=Evir_i^\ell = E v_i^\ell and store the top KK tokens.
  2. Inference per-token:
    • Compute m=ϕ(W1x+b1)m^\ell = \phi(W_1^\ell x^\ell + b_1^\ell).
    • Select the kk indices with largest mi|m_i^\ell| for dominant sub-updates.
    • Accumulate Δu=iSmiri\Delta u^\ell = \sum_{i \in S} m_i^\ell r_i^\ell for chosen SS.
    • Update logits and evaluate the new output distribution via softmax.

Efficiency notes:

  • Only dominant (top-kk) sub-updates need be materialized, reducing computation to O(kV)\mathcal{O}(k|V|).
  • Storage of full rir_i^\ell vectors can be avoided for many applications by caching only top token IDs and scores, facilitating concept “naming” without full vector materialization.

Pseudocode is provided in (Geva et al., 2022), enabling straightforward replication.

4. Empirical Results: Control and Efficiency

Two main experimental results substantiate the utility of TaylorDecomp-FFN:

a) Zero-shot Toxicity Reduction in GPT-2:

By identifying 10 “harmless” concept neurons (those with non-toxic top tokens in rir_i^\ell) and forcing mi+3m_i^\ell \leftarrow +3 at every FFN layer, toxic outputs from GPT-2 are reduced by approximately 50%, as measured over 1,225 “hard” prompts (REALTOXICPROMPTS, Perspective API). Baseline self-debiasing approaches achieve only 13% reduction with the same parameter budget, and post-hoc word filters provide 0%. The associated perplexity increase is mild (≈ 15% relative). Performance metrics are averaged across 6 toxicity assessment criteria (Table 5 in (Geva et al., 2022)).

b) Early Exit via Dominant Sub-Updates (WIKILM):

A token is designated as “saturated” in layer \ell if its final prediction is already rank 1 at that stage. Employing a nearest-neighbor rule to match dominant neuron clusters leads to “early-exit” with 94.1% exit accuracy and 20% average layer savings. This approach matches or outperforms logistic regression classifiers on hidden state or FFN outputs, but requires no further training (Table 6 in (Geva et al., 2022)).

5. Limitations and Directions for Extension

Several limitations and open avenues are identified:

  • Application is restricted to decoder-only LMs (GPT-style) and a single encoder LM (WIKILM). Encoder-only and masked-LM architectures (BERT/RoBERTa) remain unexplored.
  • The decomposition treats value vectors viv_i^\ell individually; compositional interactions (pairwise or higher-order) may play a role but their analysis is exponentially more complex.
  • Human annotation of concepts is nontrivial due to rare words, subwords, and world knowledge; current coverage may be underestimated.
  • Interventions alter only the exposure of information (logit weighting), not the underlying model knowledge. For example, reducing toxic outputs suppresses, but does not erase, such content.
  • Prospective work includes “editing” value vectors viv_i^\ell to remove undesirable concepts, or integrating adaptive re-scaling of sub-updates (curriculum schemes) during fine-tuning.

6. Conceptual and Methodological Significance

TaylorDecomp-FFN provides a granular, per-neuron analysis of prediction construction in Transformer models, converting opaque FFN parameterizations into additive, interpretable contributions in vocabulary space. This approach elucidates the architectural mechanisms behind concept promotion or demotion at each layer, enabling real-time, zero-shot interventions for both safety (e.g., toxicity reduction) and computational efficiency (e.g., early exit). The method offers a unifying framework for understanding and manipulating FFN-driven semantic and syntactic abstraction in neural LLMs (Geva et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TaylorDecomp-FFN.