Compact-FFN: Efficient Transformer Design

Updated 21 March 2026

Compact-FFN (cFFN) is an approach that refines transformer feed-forward networks by reducing redundancy and computational overhead with techniques like low-rank factorization and parameter sharing.
It employs methods such as multi-branch re-parameterization, NAS-guided structure search, and chunk-wise partitioning to balance efficiency with minimal accuracy loss.
Empirical results across vision, language, and speech tasks show that cFFN designs can significantly cut parameters and FLOPs while maintaining competitive performance.

A compact-FFN (cFFN) is a family of architectural and algorithmic strategies for the feed-forward network component in transformers, aiming to reduce computational complexity, parameter count, and memory footprint with minimal loss in predictive accuracy. The core motivation is the empirical observation that transformer FFNs—despite occupying a large fraction of total parameters and FLOPs—contain substantial redundancy, particularly in large models or under extreme parameter sharing. Multiple designs for cFFN have emerged, including low-rank factorization, parameter sharing, chunk-wise partitioning, structured operation search, and integration with conditional experts, each with distinct performance and implementation trade-offs across vision, language, and speech domains.

1. Motivation: FFN Redundancy and Computational Bottlenecks

The transformer FFN typically consists of two position-wise linear maps with a non-linearity, accounting for the majority of parameters and floating point operations (FLOPs) in standard transformer layers. Analysis of Vision Transformers (ViT), BERT-derived models, and modern encoder-decoder architectures reveals highly redundant FFN representations across layers or within the weight matrices themselves (Xu et al., 2023, Pires et al., 2023). In ViT and BERT, FFN FLOPs can be 2–3 times larger than those of multi-head self-attention, driving the need for compact alternatives (Dong et al., 2021). Empirical studies indicate that reducing or re-structuring the FFN often results in only modest or negligible drops in accuracy, especially when appropriate compensatory mechanisms or search strategies are applied (Pires et al., 2023, Dong et al., 2021).

2. Low-Rank and Factorization Approaches

A primary line of cFFN design employs low-rank factorization of the larger projection matrix in the FFN. In ViT-based cFFN (Xu et al., 2023), the hidden-to-output matrix $W_2 \in \mathbb{R}^{d_h \times d}$ is replaced by a product $W_2 \approx B A$ where $A \in \mathbb{R}^{r \times d}$ , $B \in \mathbb{R}^{d_h \times r}$ , and $r \ll \min(d, d_h)$ . The forward computation becomes:

$z = X W_1 + b_1$
$h = \sigma(z)$
$Y = B(A h) + b_2$

This factorization reduces parameter count for $W_2$ from $d_h d$ to $r(d + d_h)$ . To avoid representational collapse inherent in pure low-rank factorization, a multi-branch re-parameterization is used during training: each of $A$ and $B$ is replaced with a sum of $r$ parallel small linear branches (typically $1 \times 1$ convolutions plus batch normalization), all merged post-training. The result is a single low-rank projection and bias at inference, maintaining capacity during optimization (Xu et al., 2023).

Letting $d_h = m d$ and $r = t \cdot (d_h / (m+1))$ , with $t \in (0,1)$ , the reduction ratio in FLOPs is $(1+t)/2$ and in parameters is $t$ . This allows precise control over the accuracy-efficiency trade-off.

Strict parameter sharing in recursive or ALBERT-like transformers collapses expressivity, especially when both attention and FFN parameters are shared across groups of layers. The cFFN in ModernALBERT (Nouriborji et al., 14 Dec 2025) addresses this by embedding a Mixture-of-LoRAs (MoL) within the shared FFN. Here, each token’s FFN transformation is modulated by a convex combination of a base shared weight $W$ and several low-rank (< $d$ ) expert deltas $B_i A_i$ . Expert selection and weighting are performed by a router (two-layer MLP with top-2 sparse softmax) that computes $\alpha(x) \in \Delta^E$ . The effective FFN for token $x$ is:

$c\text{FFN}(x) = \sum_{i=1}^E \alpha_i(x) (W + B_i A_i) x + b$

At inference, all expert adapters can be merged (by uniform or EMA-weighted averaging) into a single static adapter, eliminating conditional computation overhead while preserving accuracy (Nouriborji et al., 14 Dec 2025).

4. Structure Search and Nonlinear Primitive Composition

EfficientBERT (Dong et al., 2021) generalizes cFFN design by employing a progressive neural architecture search (NAS) over a directed acyclic graph (DAG) of MLP primitives. The search space spans:

Expansion ratios $r \in \{1, 1/2, 1/3, 1/4\}$
Stack depth $S$ of MLP layers
Choice and arrangement of unary/binary nonlinear primitives (e.g., GeLU, Swish, Add, Mul, Max)

Each sampled cFFN cell is evaluated in situ, with parameters “sliced” from a pre-trained supernet and distilled from a BERT base teacher at each stage. The search yields highly nonlinear, highly compact FFN cells tailored for downstream tasks and resource budgets. Final EfficientBERT models with 6.9 $\times$ fewer parameters and 4.4 $\times$ faster inference than BERT $_\text{BASE}$ are obtained by stacking layer-wise optimized cFFNs (Dong et al., 2021).

5. Chunk-wise Partitioning and Spatial Factorization

Chunk-Level Feedforward Networks (CFFN), introduced in EfficientASR for ASR tasks, propose dividing the model dimension $d_\text{model}$ into $n$ chunks and applying independent two-layer FFNs to each chunk. For $X \in \mathbb{R}^{B \times T \times d_\text{model}}$ , the data is split so that each $X_i \in \mathbb{R}^{B \times T \times (d_\text{model}/n)}$ is processed by a separate FFN, producing $Z_i$ . The final output is concatenated:

$CFFN(X) = \text{Concat}(FFN(X_1), ..., FFN(X_n))$

This results in a parameter and FLOPs reduction by a factor of $n$ in the leading quadratic term. Empirical results report a 36% reduction in model parameters and maintained or improved character error rate (CER) on Aishell-1 and HKUST when using $n=2$ (Wang et al., 2024).

A direct cFFN strategy is to replace the per-layer FFN matrix in the transformer encoder with a single, shared wide MLP, while entirely removing FFN computations in the decoder. This approach is detailed in “One Wide Feedforward is All You Need” (Pires et al., 2023):

Standard (per-layer): $N$ separate FFNs, each with their own two-layer MLP ( $[\text{ReLU}(x W_1 + b_1)] W_2 + b_2$ )
cFFN: one wide FFN (of width $d'_{ff}$ ) shared across all $N$ layers. For the encoder, the transformation is $cFFN(x) = W_2^{shared} \, \text{gelu}(W_1^{shared} x + b_1^{shared}) + b_2^{shared}$ (optionally no FFN at all in the decoder).

For $N=12$ (Transformer-Big), setting $d'_{ff}=d_{ff}$ achieves a $12\times$ reduction in FFN parameters, or with $d'_{ff}=N d_{ff}$ one preserves parameter count but improves accuracy (+0.9 BLEU on WMT22 En→De) and latency (+24%). This demonstrates that internal layerwise diversity in FFN can be repurposed as width in a shared module with minimal impact on overall model capacity (Pires et al., 2023).

7. Empirical Results, Trade-offs, and Design Guidelines

Empirical evaluations across modalities consistently highlight that cFFN approaches yield substantial savings in compute and memory with marginal (and sometimes positive) effects on task performance:

Model	Params Reduction	FLOPs Reduction	Performance Impact	Reference
DeiT-T	–18.2%	–19.0%	+0.7% acc	(Xu et al., 2023)
EfficientBERT	–85%	≈–77%	+0.7 GLUE avg	(Dong et al., 2021)
EfficientASR	–36%	~–30% mem	–0.2/–0.3% CER	(Wang et al., 2024)
Transformer-Big	~–40%	+20–25% thespeed	~–0.3 BLEU (max param sav)	(Pires et al., 2023)
ModernALBERT	×0.33×	×0.33×	+1–2 pts GLUE/BEIR/SQuAD2	(Nouriborji et al., 14 Dec 2025)

Principal guidelines for deploying cFFN solutions include:

Tuning the rank parameter $r$ (or equivalent), with $r \approx (2/3)\,d_h/(m+1)$ typically optimal in ViT cFFN (Xu et al., 2023).
Using two re-param branches in multi-branch cFFN for stability/performance balance (Xu et al., 2023).
In NAS-derived cFFNs, joint coarse-to-fine search over stack depth, expansion ratio, and primitive set yields better compactness and accuracy than single-stage search (Dong et al., 2021).
For shared/wide cFFN, wider is only beneficial up to $N\cdot d_{ff}$ ; further increases yield no gains (Pires et al., 2023).
In chunked cFFN, $n=2$ (halving FFN size) provides maximal parameter savings with minimal to no loss; aggressive chunking eventually degrades accuracy (Wang et al., 2024).
Conditional MoL-based cFFN is especially effective where strict parameter sharing would otherwise degrade performance, with expert-merging crucial for deployment efficiency (Nouriborji et al., 14 Dec 2025).

8. Concluding Perspectives

Compact-FFN modules, built from varied principles—low-rank factorization, parameter sharing, chunking, NAS-guided structure, and conditional expert routing—constitute a central direction in transformer compression and deployment. Across vision, language, and speech, cFFNs demonstrate that substantial redundancies in transformer FFNs can be excised, restructured, or refactored without sacrificing accuracy, supporting both efficient pre-training and real-time inference on resource-constrained devices (Xu et al., 2023, Nouriborji et al., 14 Dec 2025, Dong et al., 2021, Wang et al., 2024, Pires et al., 2023). The field continues to explore how cFFN principles interact with other architectural elements, with trade-offs shaped by application-specific constraints on accuracy, latency, and memory.