Merging Feed-Forward Sublayers in Transformers

Updated 10 December 2025

Merging feed-forward sublayers is a technique to combine redundant MLP modules in Transformers for model compression, multi-task fusion, and architectural simplification.
It utilizes methods such as closed-form task arithmetic, permutation-based parameter averaging, and chain-of-merges based on activation and structural similarities.
Empirical results show up to 40% parameter reduction and improved inference speed while maintaining near-baseline or even enhanced downstream performance.

Merging feed-forward sublayers refers to a family of techniques aimed at combining multiple feed-forward (FF) or multilayer perceptron (MLP) submodules within deep neural architectures, primarily Transformers, into a single or reduced set of submodules. This process leverages structural, functional, or statistical similarities among FF sublayers to enable model compression, efficient multi-task model fusion, or architectural simplification, often with minimal degradation in downstream performance. The recent literature has established a diverse set of approaches, underlining practical and theoretical distinctions between global model merging and submodule-wise operations, and providing insight into the roles of redundancy, linearity, and activation statistics in the success of feed-forward sublayer merging (Dai et al., 15 Apr 2025, Verma et al., 10 Jan 2025, Pires et al., 2023, Buzzega et al., 29 Aug 2025).

1. Definitions and Motivation

Within standard Transformer blocks, each layer comprises two principal submodules: a multi-head self-attention mechanism and a feed-forward (MLP) unit consisting of two linear transformations separated by a nonlinearity, such as GeLU or ReLU: $h = W^{\text{in}} x^{\text{in}} + b^{\text{in}}, \qquad z = \phi(h), \qquad x^{\text{out}} = W^{\text{out}} z + b^{\text{out}}$ where $W^{\text{in}} \in \mathbb{R}^{f \times d}$ , $W^{\text{out}} \in \mathbb{R}^{d \times f}$ , and $d$ and $f$ are the model and hidden dimensions, respectively.

Feed-forward sublayer merging is motivated by several empirical findings:

MLP/FF submodules, especially in Transformers, frequently learn similar or redundant representations across layers, suggesting a high degree of intrinsic mergeability (Verma et al., 10 Jan 2025).
MLP sublayers exhibit higher approximate linearity under fine-tuning and merging, in contrast to the full model or attention submodules (Dai et al., 15 Apr 2025).
Architectural simplification by merging, sharing, or removing FF sublayers yields favorable parameter-efficiency and can maintain or even enhance downstream accuracy (Pires et al., 2023).
Large-scale model fusion for multi-task settings benefits from submodule-level merging due to more accurate functional interpolation compared to global arithmetic (Dai et al., 15 Apr 2025).

2. Methodologies for Feed-Forward Sublayer Merging

Multiple strategies for FF sublayer merging have been developed, each suited to different applications (multi-model fusion, compression, architectural redesign):

2.1. Submodule-wise Task Arithmetic

The task arithmetic framework exploits the approximate linearity of MLP submodules:

For submodule $i$ , merging weights $\alpha_t^i$ are computed in closed form by minimizing the expected mean-square output discrepancy under a linear approximation:

$f_i(x; \theta_0^i + \sum_t \alpha_t^i \tau_t^i) \approx f_i(x; \theta_0^i) + \sum_t \alpha_t^i \Delta f_i(x; \theta_0^i + \tau_t^i)$

The scalar optimal merging coefficient for two tasks is

$\alpha^* = \frac{E_x\,\langle \Delta f_1(x) - \Delta f_2(x), \Delta f_1(x) \rangle}{E_x \|\Delta f_1(x) - \Delta f_2(x)\|^2}$

Merged weights are $W_m = \alpha^* W_1 + (1-\alpha^*) W_2$ , with analogous expressions for biases (Dai et al., 15 Apr 2025).

2.2. Permutation-Based Parameter Averaging

In structural compression, similar FF sublayers are located (often via a sliding window) and paired via neuron alignment:

Pre-activation features are gathered; cross-correlation matrices between feature sets of candidate sublayers are used to find optimal one-to-one correspondences via assignment solvers (Jonker–Volgenant algorithm).
Permuted parameters are averaged component-wise across the selected sublayers, and the merged FFN replaces the originals at all $k$ locations. A brief recovery fine-tuning is performed (Verma et al., 10 Jan 2025).

2.3. Chain of Merges (CoM)

CoM merges sublayers layer-wise, using an auto-regressive update of activation statistics to address internal covariate shift:

For layer $\ell$ , the merged weights $(W_M^\ell, b_M^\ell)$ minimize the aggregate squared Frobenius distance between outputs of all tasks/models on their respective (possibly shifted) input distributions, via least squares regression:

$W_M^\ell = \bigg[\sum_{i=1}^N (W_i^\ell \hat X_i^\ell) \hat X_i^{\ell \top} \bigg] \bigg[\sum_{i=1}^N \hat X_i^\ell \hat X_i^{\ell \top}\bigg]^{-1}$

$b_M^\ell = \frac{1}{N}\sum_{i=1}^N \left(\mu_i^\ell - W_M^\ell \hat{\mu}_i^\ell\right)$

Inputs for each sublayer are recursively recomputed using merged parameters from previous layers (Buzzega et al., 29 Aug 2025).

Empirical redundancy within FFNs has motivated architectural modifications:

Entire FFN sublayers in decoders may be dropped.
All encoder-side FFNs can be replaced by a single large, shared FFN (tied across layers). When widened appropriately (increased $d_{ff}'$ ), this "One Wide FFN" variant can match or surpass the baseline in both performance and efficiency (Pires et al., 2023).
Layer removal and FFN sharing form a basic form of sublayer merging, with the choice of hidden size determining the accuracy-compression trade-off.

3. Linearity, Similarity, and Mergeability Criteria

The effectiveness of sublayer merging depends critically on both linearity and activation similarity:

MLP submodules display projection distances near zero under merging, as measured by projection distance metrics, validating the local linearity required for closed-form interpolation (Dai et al., 15 Apr 2025).
Cross-layer Centered Kernel Alignment (CKA) analyses reveal high similarity (0.8–0.9) in the activation spaces of many FFNs, both for adjacent and non-adjacent pairs; attention submodules do not exhibit this redundancy (Verma et al., 10 Jan 2025).
Mergeability correlates with these similarity metrics; blocks with high CKA respond better to parameter merging and sharing, often requiring only minimal fine-tuning for downstream recovery.

4. Empirical Outcomes and Trade-Offs

Merging feed-forward sublayers offers favorable empirical trade-offs:

In multi-task merging for LLMs, per-sublayer closed-form merging (attn/MLP-level) improved multi-task accuracy by $2$–$3$ percentage points relative to global task arithmetic on strong benchmarks (e.g., Llama-2) (Dai et al., 15 Apr 2025).
For structural compression, removing 21–22% of total model parameters (e.g., in ViT-Base or GPT-2 Large) via FFN permutation-merging retained 99% of the baseline performance (BLEU, accuracy, PPL), outperforming “drop-layer” baselines (Verma et al., 10 Jan 2025).
In the “One Wide FFN” paradigm, replacing all encoder FFNs with a widened shared unit allows up to 40% parameter reduction with negligible or positive change in BLEU, COMET, and chrF scores; inference speed improves proportionally (Pires et al., 2023).

The table below summarizes characteristic empirical impacts:

Approach	Param Reduction	Accuracy Drop (Typical)	Speed Gain
Permute-merge 1/3 FFNs (Verma et al., 10 Jan 2025)	21–22%	<1% (ViT, OPUS-MT)	—
One Wide FFN (Pires et al., 2023)	40%	0~+1 BLEU	+23%
Sublayer-wise task arithmetic (Dai et al., 15 Apr 2025)	—	+2–3% multi-task acc.	—

5. Practical Procedures and Implementation Details

Practical implementation of FFN merging involves several procedural steps, including data selection, neuron alignment, and closed-form parameter averaging or regression:

Neuron permutation alignment is essential before averaging to maximize feature-wise correspondence; assignment problems are efficiently solved by combinatorial solvers.
Only a small held-out or exemplar dataset ( $\sim$ 30 to 100 samples) is required to compute feature statistics (e.g., for closed-form coefficient estimation or activation matching).
For merging by averaging, both input and output weights and their biases are transformed under the learned permutations before arithmetic averaging.
Brief task-specific fine-tuning (“recovery fine-tune”) further closes any performance gap, typically with conservative learning rates and early stopping based on validation.
In chain-of-merges approaches, activations are updated recursively at each layer to mitigate covariate shift and ensure consistency across the network (Buzzega et al., 29 Aug 2025).

Pseudocode and algorithmic details are provided verbatim in the relevant sources (Dai et al., 15 Apr 2025, Verma et al., 10 Jan 2025, Buzzega et al., 29 Aug 2025).

6. Limitations, Assumptions, and Architectural Implications

Feed-forward sublayer merging rests upon several theoretical and empirical assumptions:

Linearity Property: Effective merging requires that each FFN satisfies approximate linearity under weight interpolation (Property 3). Violations at sub-neuronal granularity or under highly nonlinear response can destabilize the closed-form (Dai et al., 15 Apr 2025).
Activation Similarity: Merging is most effective for FFNs with high inter-activation similarity; disparate feature spaces may induce sharp performance drops.
Data Dependency: Estimation of merging coefficients and activation statistics depends on access to a small but representative dataset.
First-Order Approximation: Merging approaches typically ignore higher-order (curvature) effects; strongly nonlinear regions may erode merge quality.
No Retraining: Some merging procedures are training-free, precluding adaptation beyond the linear approximation.
Architectural Redundancy: The feasibility of merging or sharing FFNs is rooted in observed representational redundancy, particularly in large-scale, deep Transformer settings.

A plausible implication is the normative architectural shift from many small, layer-specific FFNs toward a smaller number of wider, shared modules, reallocating parameter budget for improved efficiency and, at times, downstream accuracy (Pires et al., 2023).

7. Outlook and Broader Impact

Feed-forward sublayer merging has reshaped several aspects of deep model design and deployment:

As a model compression strategy, it enables substantial reductions in parameter count and inference cost without requiring major sacrifices in accuracy, supporting broader hardware deployment scenarios (Verma et al., 10 Jan 2025, Pires et al., 2023).
In multi-task model merging, sublayer-level approaches produce more reliable and interpretable model fusions, crucial for scaling foundation models to diverse application domains (Dai et al., 15 Apr 2025).
The development of recursive, activation-matching-based merging (e.g., CoM) provides robustness to inter-layer distributional shift, potentially informing future directions in modular network optimization (Buzzega et al., 29 Aug 2025).

These developments underscore the interplay between architectural redundancy, linearity, and representation similarity as the foundation for effective FFN sublayer merging, with ongoing research expanding applicability across domains and model classes.