Papers
Topics
Authors
Recent
Search
2000 character limit reached

Feed-Forward Layers in Transformers

Updated 22 March 2026
  • Feed-forward layers in Transformers are multilayer, position-wise networks that act as high-capacity key–value memory systems for pattern detection and semantic refinement.
  • They maintain isotropy, prevent embedding collapse, and incrementally refine token predictions through non-linear transformations across model depth.
  • Architectural variants such as sparse, MoE-based, and parameter-sharing FFNs optimize performance and efficiency while enabling targeted knowledge adaptation.

Feed-forward layers in Transformer architectures are multilayer, position-wise neural networks that follow each self-attention sub-block. While often described as simple multi-layer perceptrons (MLPs), recent research has reframed these sublayers as high-capacity key–value memory systems responsible for both parametric expressiveness and supporting the semantic refinement of representations across layers. Feed-forward networks (FFNs) account for the majority of the model parameters and computational budget, driving ongoing research into their function, efficiency, architectural variants, and implications for multilinguality, knowledge encoding, and parameter adaptation.

1. Mathematical Characterization and Key–Value Memory Interpretation

A standard Transformer FFN at layer ℓ takes the output h()Rdmodelh^{(\ell)}\in\mathbb{R}^{d_{model}} from the attention sublayer and computes

k()=W1()h()+b1()k^{(\ell)} = W_1^{(\ell)} h^{(\ell)} + b_1^{(\ell)}

a()=ϕ(k())a^{(\ell)} = \phi \left( k^{(\ell)} \right)

v()=W2()a()+b2()v^{(\ell)} = W_2^{(\ell)} a^{(\ell)} + b_2^{(\ell)}

where W1()Rdff×dmodelW_1^{(\ell)} \in \mathbb{R}^{d_{ff}\times d_{model}}, W2()Rdmodel×dffW_2^{(\ell)} \in \mathbb{R}^{d_{model}\times d_{ff}}, and ϕ\phi is typically the Gaussian Error Linear Unit (GeLU). Here, W1W_1 maps the input into a "key" space, and W2W_2 projects activated "detectors" back to the feature space. The sequence of operations is interpreted as a collection of key–value memories: each row of W1W_1 is a key (pattern detector), and each column of W2W_2 is a value used to compose the output. The activation ϕ\phi gates the selection, making the FFN a soft pattern-matching lookup table with dffd_{ff} entries per layer (Geva et al., 2020, Bhattacharya et al., 2023).

Empirical evidence shows that in lower layers, keys respond to lexical or shallow patterns, while in upper layers, they capture higher-level and semantic patterns. Corresponding values increasingly predict the next token distribution with sharpness and specificity at higher layers, justifying the key–value memory analogy (Geva et al., 2020, Geva et al., 2022, Bhattacharya et al., 2023).

2. Functional Role and Interaction with Attention

The FFN's principal function involves non-linear, feature-wise transformation to re-mix and expand representations per position, injecting new capacity independent of sequence context handled by attention. Its functional roles include:

  • Isotropy maintenance: FFNs prevent the collapse of token embeddings, maintaining representation isotropy and mitigating rank degeneration. Experimental removal of FFNs leads to singular embedding collapse even when deep attention is preserved (Sonkar et al., 2023).
  • Incremental output refinement: Each FFN layer produces an additive update to the token-prediction distribution in the vocabulary space, with outputs acting as incremental, concept-promoting modifications to the logit space. This process sharpens the model's prediction across depth, as evidenced by “tuned-lens” analyses (Geva et al., 2022, Bhattacharya et al., 2023).
  • Parameter dominance: FFNs typically consume two-thirds of non-embedding parameters in standard Transformer designs, making their design and efficiency a primary determinant of overall model capacity and speed (Geva et al., 2020, Zhang et al., 2021).

Analysis further shows that FFNs are less essential for introducing non-linearity, compared to attention modules, but cannot be omitted or trivially replaced by attention without loss of performance on language modeling and GLUE tasks (Zhao et al., 2021).

3. Layerwise Roles, Specialization, and Multilinguality

FFN importance and “specialization” vary non-trivially with layer depth:

  • Layerwise importance: Empirical studies demonstrate that middle FFN layers are most critical for model performance, particularly in language modeling and factual recall, while first and last layers play a lesser role (Ikeda et al., 25 Aug 2025). Concentrating FFN capacity in ~70% of consecutive middle layers yields consistently superior performance as compared to uniform capacity allocation or allocation to input/output layers alone ((Ikeda et al., 25 Aug 2025), Table 1).
  • Language specificity in multilingual models: In autoregressive multilingual models, early and late FFN layers are more “language-specialized,” activating neurons specific to the input language, while middle layers are relatively language-neutral, encoding semantic (language-agnostic) features. Classifying language from FFN activations is easiest in boundary layers, with accuracy peaking at the beginning and end of the stack and dropping in the middle (Bhattacharya et al., 2023).

This layerwise specialization suggests that architectural modifications for domain/multilingual adaptation should target early and late FFNs for increased specialization (e.g., through language-specific adapters), while promoting sharing in the middle.

4. Architectural Variants and Parameter Efficiency

A number of FFN variants have been explored for efficiency and expressiveness:

  • Sparse/Mixture-of-Experts FFN (S-FFN, MoE): Instead of always activating all neurons, S-FFN selectively activates blocks (“experts”) based on routing criteria. "MoEfication" partitions FFN parameters into experts and uses learned or deterministic routing, recovering ≥95% performance while using only 10–30% of FFN parameters per token. This reduces inference cost by up to 2× and enables functional partitioning (Zhang et al., 2021, Liu et al., 2023).
  • FFN merging and sharing: Adjacent or similar FFN sublayers display high activation similarity (measured by CKA). Post-training merging/tieing of these sublayers achieves 20–30% parameter savings with modest or no performance loss, outperforming structured drop-layer pruning. Further, sharing one wide FFN across all encoder layers (and optionally dropping decoder FFNs) enables dramatic parameter and speed gains with negligible or even positive accuracy changes if the shared module is widened sufficiently to compensate (Verma et al., 10 Jan 2025, Pires et al., 2023).
  • Depth and width: Increasing FFN depth (3-layer rather than 2-layer, with each intermediate narrow or wide) enables trading off the number of blocks for FFN depth. This configuration yields lower loss and more parameter-efficient learning compared to just widening two-layer FFNs (Gerber, 10 May 2025).
  • FL-tuning (layerwise adaptation): In the context of parameter-efficient adaptation, expanding only the FFN hidden dimension with newly trained units (while freezing most weights) matches or exceeds full-model fine-tuning on many tasks, using only ~3% of the parameters (Liu et al., 2022).

5. Knowledge Storage, Editing, and Interpretability

FFNs encode not only pattern-detection and vocabulary-promotion mechanisms but are also the locus of factual and linguistic knowledge in LLMs:

  • Knowledge neurons: Intermediate FFN neurons are identifiable as “knowledge neurons,” with high mutual information with specific factual content and strong impact on prediction when ablated. Knowledge injection and retrieval at these sites improves performance on knowledge-intensive tasks (Yao et al., 2022).
  • Knowledge editing and fine-tuning: Ablations comparing updates to the FFN's "keys" (W1W_1) and "values" (W2W_2) reveal that key updates allow faster, more localized, and more generalizable modification of stored knowledge. Key tuning alters routing without global interference, while value tuning is more disruptive and slower to optimize. LoRA-style adaptation further confirms the advantage of key-side updates for instruction adaptation and continual learning (Qiu et al., 2024).
  • Concept promotion in vocabulary space: FFN value vectors (viv_i) project to “concept-vectors” in the vocabulary space that selectively promote interpretable sets of tokens (e.g., semantic classes, syntactic groups). Most FFN action is to promote candidate continuations rather than demote incorrect ones. Manipulating these activations enables targeted debiasing and controllable generation, as well as computation savings via early exit (Geva et al., 2022).

Interpretability methods—probing, attribution, and visualization in the activation space—are increasingly focused on understanding these internal FFN mechanisms and their role in prediction composition (Vijayakumar, 2023, Kobayashi et al., 2023).

6. Practical Implications and Design Considerations

Converging evidence leads to several practical conclusions for Transformer design:

  • Avoid uniform FFN allocation: Concentrate FFN capacity in the middle layers for general-purpose models; prune or reallocate resources in the boundary layers for increased efficiency (Ikeda et al., 25 Aug 2025).
  • Use parameter-sharing or merging: Where high cross-layer similarity is observed, aggressively tie or merge FFN parameters, recovering performance with a smaller model (Verma et al., 10 Jan 2025, Pires et al., 2023).
  • Adopt sparse and MoE-based FFN variants: For scaling under FLOPs or memory constraints, sparse FFN activation and key-based routing (e.g., Avg-K) improves language modeling perplexity and efficiency compared to dense or statically partitioned blocks (Liu et al., 2023, Zhang et al., 2021).
  • Exploit FFN as a locus for knowledge injection and adaptation: Techniques such as Kformer or FL-tuning directly use the key–value structure for external knowledge fusion and task-specific adaptation with minimal compute (Yao et al., 2022, Liu et al., 2022).

In sum, the FFN layers in Transformers are not mere MLP stages but the foundation of key–value memory, pattern detection, concept promotion, and knowledge storage. Their layerwise specialization, architectural flexibility, and growing interpretability make them central targets for both practical and theoretical improvements in Transformer-based language and vision models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feed-Forward Layers in Transformers.