Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compact-FFN: Efficient Transformer Design

Updated 21 March 2026
  • Compact-FFN (cFFN) is an approach that refines transformer feed-forward networks by reducing redundancy and computational overhead with techniques like low-rank factorization and parameter sharing.
  • It employs methods such as multi-branch re-parameterization, NAS-guided structure search, and chunk-wise partitioning to balance efficiency with minimal accuracy loss.
  • Empirical results across vision, language, and speech tasks show that cFFN designs can significantly cut parameters and FLOPs while maintaining competitive performance.

A compact-FFN (cFFN) is a family of architectural and algorithmic strategies for the feed-forward network component in transformers, aiming to reduce computational complexity, parameter count, and memory footprint with minimal loss in predictive accuracy. The core motivation is the empirical observation that transformer FFNs—despite occupying a large fraction of total parameters and FLOPs—contain substantial redundancy, particularly in large models or under extreme parameter sharing. Multiple designs for cFFN have emerged, including low-rank factorization, parameter sharing, chunk-wise partitioning, structured operation search, and integration with conditional experts, each with distinct performance and implementation trade-offs across vision, language, and speech domains.

1. Motivation: FFN Redundancy and Computational Bottlenecks

The transformer FFN typically consists of two position-wise linear maps with a non-linearity, accounting for the majority of parameters and floating point operations (FLOPs) in standard transformer layers. Analysis of Vision Transformers (ViT), BERT-derived models, and modern encoder-decoder architectures reveals highly redundant FFN representations across layers or within the weight matrices themselves (Xu et al., 2023, Pires et al., 2023). In ViT and BERT, FFN FLOPs can be 2–3 times larger than those of multi-head self-attention, driving the need for compact alternatives (Dong et al., 2021). Empirical studies indicate that reducing or re-structuring the FFN often results in only modest or negligible drops in accuracy, especially when appropriate compensatory mechanisms or search strategies are applied (Pires et al., 2023, Dong et al., 2021).

2. Low-Rank and Factorization Approaches

A primary line of cFFN design employs low-rank factorization of the larger projection matrix in the FFN. In ViT-based cFFN (Xu et al., 2023), the hidden-to-output matrix W2Rdh×dW_2 \in \mathbb{R}^{d_h \times d} is replaced by a product W2BAW_2 \approx B A where ARr×dA \in \mathbb{R}^{r \times d}, BRdh×rB \in \mathbb{R}^{d_h \times r}, and rmin(d,dh)r \ll \min(d, d_h). The forward computation becomes:

  • z=XW1+b1z = X W_1 + b_1
  • h=σ(z)h = \sigma(z)
  • Y=B(Ah)+b2Y = B(A h) + b_2

This factorization reduces parameter count for W2W_2 from dhdd_h d to r(d+dh)r(d + d_h). To avoid representational collapse inherent in pure low-rank factorization, a multi-branch re-parameterization is used during training: each of AA and BB is replaced with a sum of rr parallel small linear branches (typically 1×11 \times 1 convolutions plus batch normalization), all merged post-training. The result is a single low-rank projection and bias at inference, maintaining capacity during optimization (Xu et al., 2023).

Letting dh=mdd_h = m d and r=t(dh/(m+1))r = t \cdot (d_h / (m+1)), with t(0,1)t \in (0,1), the reduction ratio in FLOPs is (1+t)/2(1+t)/2 and in parameters is tt. This allows precise control over the accuracy-efficiency trade-off.

3. Parameter Sharing, Expert Routing, and Mixture-of-LoRAs

Strict parameter sharing in recursive or ALBERT-like transformers collapses expressivity, especially when both attention and FFN parameters are shared across groups of layers. The cFFN in ModernALBERT (Nouriborji et al., 14 Dec 2025) addresses this by embedding a Mixture-of-LoRAs (MoL) within the shared FFN. Here, each token’s FFN transformation is modulated by a convex combination of a base shared weight WW and several low-rank (<dd) expert deltas BiAiB_i A_i. Expert selection and weighting are performed by a router (two-layer MLP with top-2 sparse softmax) that computes α(x)ΔE\alpha(x) \in \Delta^E. The effective FFN for token xx is:

  • cFFN(x)=i=1Eαi(x)(W+BiAi)x+bc\text{FFN}(x) = \sum_{i=1}^E \alpha_i(x) (W + B_i A_i) x + b

At inference, all expert adapters can be merged (by uniform or EMA-weighted averaging) into a single static adapter, eliminating conditional computation overhead while preserving accuracy (Nouriborji et al., 14 Dec 2025).

4. Structure Search and Nonlinear Primitive Composition

EfficientBERT (Dong et al., 2021) generalizes cFFN design by employing a progressive neural architecture search (NAS) over a directed acyclic graph (DAG) of MLP primitives. The search space spans:

  • Expansion ratios r{1,1/2,1/3,1/4}r \in \{1, 1/2, 1/3, 1/4\}
  • Stack depth SS of MLP layers
  • Choice and arrangement of unary/binary nonlinear primitives (e.g., GeLU, Swish, Add, Mul, Max)

Each sampled cFFN cell is evaluated in situ, with parameters “sliced” from a pre-trained supernet and distilled from a BERT base teacher at each stage. The search yields highly nonlinear, highly compact FFN cells tailored for downstream tasks and resource budgets. Final EfficientBERT models with 6.9×\times fewer parameters and 4.4×\times faster inference than BERTBASE_\text{BASE} are obtained by stacking layer-wise optimized cFFNs (Dong et al., 2021).

5. Chunk-wise Partitioning and Spatial Factorization

Chunk-Level Feedforward Networks (CFFN), introduced in EfficientASR for ASR tasks, propose dividing the model dimension dmodeld_\text{model} into nn chunks and applying independent two-layer FFNs to each chunk. For XRB×T×dmodelX \in \mathbb{R}^{B \times T \times d_\text{model}}, the data is split so that each XiRB×T×(dmodel/n)X_i \in \mathbb{R}^{B \times T \times (d_\text{model}/n)} is processed by a separate FFN, producing ZiZ_i. The final output is concatenated:

  • CFFN(X)=Concat(FFN(X1),...,FFN(Xn))CFFN(X) = \text{Concat}(FFN(X_1), ..., FFN(X_n))

This results in a parameter and FLOPs reduction by a factor of nn in the leading quadratic term. Empirical results report a 36% reduction in model parameters and maintained or improved character error rate (CER) on Aishell-1 and HKUST when using n=2n=2 (Wang et al., 2024).

6. Weight Sharing and Wide-MLP Substitution

A direct cFFN strategy is to replace the per-layer FFN matrix in the transformer encoder with a single, shared wide MLP, while entirely removing FFN computations in the decoder. This approach is detailed in “One Wide Feedforward is All You Need” (Pires et al., 2023):

  • Standard (per-layer): NN separate FFNs, each with their own two-layer MLP ([ReLU(xW1+b1)]W2+b2[\text{ReLU}(x W_1 + b_1)] W_2 + b_2)
  • cFFN: one wide FFN (of width dffd'_{ff}) shared across all NN layers. For the encoder, the transformation is cFFN(x)=W2sharedgelu(W1sharedx+b1shared)+b2sharedcFFN(x) = W_2^{shared} \, \text{gelu}(W_1^{shared} x + b_1^{shared}) + b_2^{shared} (optionally no FFN at all in the decoder).

For N=12N=12 (Transformer-Big), setting dff=dffd'_{ff}=d_{ff} achieves a 12×12\times reduction in FFN parameters, or with dff=Ndffd'_{ff}=N d_{ff} one preserves parameter count but improves accuracy (+0.9 BLEU on WMT22 En→De) and latency (+24%). This demonstrates that internal layerwise diversity in FFN can be repurposed as width in a shared module with minimal impact on overall model capacity (Pires et al., 2023).

7. Empirical Results, Trade-offs, and Design Guidelines

Empirical evaluations across modalities consistently highlight that cFFN approaches yield substantial savings in compute and memory with marginal (and sometimes positive) effects on task performance:

Model Params Reduction FLOPs Reduction Performance Impact Reference
DeiT-T –18.2% –19.0% +0.7% acc (Xu et al., 2023)
EfficientBERT –85% ≈–77% +0.7 GLUE avg (Dong et al., 2021)
EfficientASR –36% ~–30% mem –0.2/–0.3% CER (Wang et al., 2024)
Transformer-Big ~–40% +20–25% thespeed ~–0.3 BLEU (max param sav) (Pires et al., 2023)
ModernALBERT ×0.33× ×0.33× +1–2 pts GLUE/BEIR/SQuAD2 (Nouriborji et al., 14 Dec 2025)

Principal guidelines for deploying cFFN solutions include:

  • Tuning the rank parameter rr (or equivalent), with r(2/3)dh/(m+1)r \approx (2/3)\,d_h/(m+1) typically optimal in ViT cFFN (Xu et al., 2023).
  • Using two re-param branches in multi-branch cFFN for stability/performance balance (Xu et al., 2023).
  • In NAS-derived cFFNs, joint coarse-to-fine search over stack depth, expansion ratio, and primitive set yields better compactness and accuracy than single-stage search (Dong et al., 2021).
  • For shared/wide cFFN, wider is only beneficial up to NdffN\cdot d_{ff}; further increases yield no gains (Pires et al., 2023).
  • In chunked cFFN, n=2n=2 (halving FFN size) provides maximal parameter savings with minimal to no loss; aggressive chunking eventually degrades accuracy (Wang et al., 2024).
  • Conditional MoL-based cFFN is especially effective where strict parameter sharing would otherwise degrade performance, with expert-merging crucial for deployment efficiency (Nouriborji et al., 14 Dec 2025).

8. Concluding Perspectives

Compact-FFN modules, built from varied principles—low-rank factorization, parameter sharing, chunking, NAS-guided structure, and conditional expert routing—constitute a central direction in transformer compression and deployment. Across vision, language, and speech, cFFNs demonstrate that substantial redundancies in transformer FFNs can be excised, restructured, or refactored without sacrificing accuracy, supporting both efficient pre-training and real-time inference on resource-constrained devices (Xu et al., 2023, Nouriborji et al., 14 Dec 2025, Dong et al., 2021, Wang et al., 2024, Pires et al., 2023). The field continues to explore how cFFN principles interact with other architectural elements, with trade-offs shaped by application-specific constraints on accuracy, latency, and memory.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compact-FFN (cFFN).