Efficient Feed-Forward Networks

Updated 2 April 2026

Efficient Feed-Forward Networks are a family of innovations designed to reduce redundancy in deep FF modules, thereby enhancing computational, memory, and deployment efficiency.
They leverage techniques such as parameter sharing, conditional activation, and explicit algebraic optimization to achieve significant speedups and model compression.
Empirical results demonstrate up to 220× speedup, >40% parameter reduction, and maintained or improved accuracy across language, vision, and other domains.

An Efficient Feed-Forward Network (EFFN) is a family of architectural and algorithmic innovations that dramatically improve the computational, memory, and deployment efficiency of feed-forward neural networks, with a particular emphasis on modern deep architectures such as Transformers, Mixture-of-Experts, and dense multi-layer perceptrons. EFFN designs exploit architectural redundancy, conditional computation, explicit algebraic optimization, and adaptive parameter compression to either substitute or enhance conventional dense feed-forward modules. Empirical evidence demonstrates that EFFNs can achieve substantial speedups, aggressive model compression, and even accuracy improvements relative to standard baselines, across domains including language modeling, vision, and general supervised learning.

1. Motivations and Core Principles

Feed-forward networks constitute the bulk of learnable parameters and floating-point operations in deep neural architectures; for instance, Transformer-style FFNs account for ∼2/3 of parameters and latency in state-of-the-art LLMs (Liu et al., 2024). Historically, these modules exhibit high degrees of redundancy—either across depth (multiple FFNs per layer with shared or similar functions), within width (many neurons remain idle or underutilized per input), or in parameterization (over-parameterized networks yielding compressible solutions). EFFN frameworks seek to systematically eliminate this redundancy through one or more of:

Parameter sharing across layers and/or tokens (shared wide FFNs)
Conditional activation and routing (tree-based FFF/EFFN, expert selection)
Heavy-hitter neuron identification and selective resource allocation
Model pruning with transfer learning for width/depth reduction
Closed-form or explicit algebraic solutions for weights in supervised training
Mixture-of-Experts analogues with load-balancing and master expert enhancements

These approaches are motivated by the need to minimize both the wall-clock and memory requirements for deployment on commodity or low-resource hardware, without sacrificing task accuracy or model robustness.

The standard approach in Transformers is a stack of $L$ independently parameterized 2-layer FFNs, each mapping $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ . Empirically, a large portion of these FFNs learn overlapping “key-value” memories, leading to redundancy (Pires et al., 2023). EFFN strategies address this by:

Removing decoder-side FFNs entirely, replacing them with identity mappings or bypasses.
Tying all encoder-side FFNs to a single shared module: for $N_{\text{enc}}$ encoder layers, a single FFN is used, re-applied identically at each depth.

To compensate, the shared FFN’s hidden size $D_{\text{shared}}$ is increased: $D_{\text{shared}} = f$ (standard), $D_{\text{shared}} \approx N_{\text{enc}} f$ (recovers baseline accuracy with fewer parameters), $D_{\text{shared}} = L f$ (equal to total original FFN capacity). This technique yields significant parameter savings ( $>$ 40%), with speedups of 20–24% in decoding, and—when sufficiently widened—can even surpass the baseline BLEU in machine translation (+0.9 BLEU for WMT22 En→De at identical parameter count) (Pires et al., 2023).

These results show that the bulk of Transformer FFN parameters are redundant across depth, and that wide, shared EFFNs are both more efficient and potentially more accurate.

3. Conditional, Split, and Fast Feed-Forward Computation

Several EFFN implementations use conditional computation—only evaluating a small subnetwork per input, based on a learned routing structure. The FFF (Fast Feed-Forward Network) (Belcak et al., 2023), and its enhanced variant eFFF (Charalampopoulos et al., 2024), use a binary tree of $D$ differentiable sigmoid gates atop $2^D$ small “leaf” FF blocks (experts). During training, the entire tree is softly traversed for differentiable mixture output: $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 0 where $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 1 is the product of binary gate probabilities along the path, and $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 2 is the small expert output. At inference, each gate is hard-thresholded, and only one expert is activated: $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 3

In terms of computational complexity, a standard layer of width $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 4 costs $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 5 per input, while FFF/EFFN variants cost only $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 6, with $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 7 the leaf width. Empirically, this achieves up to 220× speedup over dense FFNs and preserves $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 894% of original accuracy in ViT-like architectures (Belcak et al., 2023). Enhanced FFF introduces two further elements:

Load balancing penalty, pushing routing probabilities to use all leaves equally (reduces run-to-run performance variance from ±29% to ±1–3% on MNIST).
Addition of a small Master Leaf “global expert” whose output is always included in the prediction: $d_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}$ 9 This improves accuracy by up to 3% on test sets and substantially reduces output variance compared to vanilla FFF (Charalampopoulos et al., 2024).

4. Explicit Algebraic Solutions and Layerwise Optimization

A further avenue for EFFN efficiency is the derivation of explicit, closed-form solutions for FFN weights, avoiding gradient-based iterative training. The SAFFU approach (Williams et al., 2023) derives a “column-translation” solution for single-layer softmax-FFNs, based on co-occurrence statistics: $N_{\text{enc}}$ 0 where $N_{\text{enc}}$ 1 is the co-occurrence of input component $N_{\text{enc}}$ 2 and output category $N_{\text{enc}}$ 3, and $N_{\text{enc}}$ 4.

For multi-layer networks, weights are sequentially solved given explicit targets at each layer, possibly in concert with neural attention (e.g., Self-Attentive Feed-Forward Units). This approach enables rapid prototype training, allowing ablation of hundreds of architectural variants on modest datasets in orders-of-magnitude less time than backprop alone. Empirically, explicit initialization followed by backprop-based fine-tuning yields lower perplexity (test PPL ≈ 23.8 vs. 64) and enables parameter-efficient, well-generalized models for low-resource or embedded deployment (Williams et al., 2023).

5. Compression, Heavy-Hitter Subnetworks, and Pruning

Highly efficient EFFN variants split the FFN into two subnetworks based on neuron “activity” or importance. The FFSplit strategy (Liu et al., 2024) identifies a heavy-hitter set $N_{\text{enc}}$ 5 of neurons (most of the $N_{\text{enc}}$ 6 norm of activations for any input), executing the following process:

For each FFN neuron, compute average per-neuron output norm over a calibration set.
Select top- $N_{\text{enc}}$ 7 neurons (heavy hitters) as $N_{\text{enc}}$ 8, the rest as $N_{\text{enc}}$ 9.
Partition $D_{\text{shared}}$ 0 into $D_{\text{shared}}$ 1, define $D_{\text{shared}}$ 2 and $D_{\text{shared}}$ 3.
Assign high capacity to $D_{\text{shared}}$ 4 (full precision), apply strong compression (low-rank SVD, quantization) to $D_{\text{shared}}$ 5.
Fine-tune to recover accuracy.

In BERT-Base and BERT-Large models, FFSplit achieves a 43% parameter reduction with accuracy drops of ≤1% and $D_{\text{shared}}$ 6– $D_{\text{shared}}$ 7 speedup on LLM inference (Liu et al., 2024). When integrated with extreme quantization (e.g., 3-bit weight quant), FFSplit recovers much of the perplexity loss, showing robust generalization under aggressive compression.

Another orthogonal approach prunes entire neurons based on low output variance, transferring pruned neuron means into biases and retraining. This achieves 70–99% parameter reduction with accuracy preservation or improvement across datasets, outperforming classical layer-wise pruning pipelines (Balderas et al., 2023).

6. Applications, Key Results, and Limitations

EFFNs have been demonstrated in multiple domains:

Efficient approximations of dynamic recurrent nets for attractor computations, enabling single-forward-pass inference in $D_{\text{shared}}$ 8 time, maintaining noise rejection and WTA behaviors (Muir, 2017).
Transformer LLMs with wide-shared or split FFNs, supporting translation, BERT, and OPT-class models (Pires et al., 2023, Liu et al., 2024).
Vision Transformers where fast (conditional) FFNs replace dense layers, attaining large speedups with minimal accuracy loss (Belcak et al., 2023).

Empirical performance across techniques is summarized below:

Paper	Main Approach	Compression/Speedup	Accuracy/Task Impact
(Pires et al., 2023)	Shared+Wide FFN	20–40% param. reduction, +24% spd	+0.9 BLEU on WMT22, baseline+
(Liu et al., 2024)	Split heavy-hitter FFN	43.1% param, 1.25–1.56× speedup	≤1% drop, 5× quantization PPL
(Belcak et al., 2023)	Tree-based (FFF)	up to 220× FFN, 6× MoE	94.2% ViT accuracy retained
(Williams et al., 2023)	Explicit algebraic init	100× training speed, 3–10× less data	Test PPL ≈24 vs. 64
(Balderas et al., 2023)	Pruning + retrain	70–99% param, 2–40× smaller	ACC↑ or preserved
(Charalampopoulos et al., 2024)	Load-bal, master leaf	Accuracy +3%, var.↓~10× (MNIST/FMNIST)	Sublinear inference cost

Limitations and caveats include:

Extreme compression can degrade performance in out-of-distribution settings.
Conditional computation architectures may introduce overhead if not efficiently implemented in target hardware.
For autoregressive or decoder-only models, wide sharing or highly conditional FFNs create latency bottlenecks.
Explicit algebraic solutions assume restricted activation and loss forms, possibly limited for large vocabulary or regression tasks.
Most EFFNs focus solely on FFN modules, leaving attention and embedding submodules as potential further bottlenecks.

7. Outlook and Integration into Modern Deep Learning

EFFN methodologies provide a comprehensive toolkit for scaling, compressing, and accelerating deep neural networks without compromising learning capacity. Key practical guidelines include:

For Transformer encoder-decoders: share FFN layers, widen as needed, and drop decoder-side FFNs for speed.
For resource-constrained deployment: partition FFNs into heavy- and sparse-components, compress accordingly, fine-tune for recovery.
For hardware-critical inference: tree-based EFFNs achieve maximal reduction in runtime at minimal expressivity loss.
For training efficiency and architecture search: closed-form layerwise EFFNs provide rapid ablation and warm-start points for fine-tuning.
Add load balancing and global expert terms to stabilize conditional EFFN training.

EFFNs thus serve not as a single architecture, but as a paradigm for advancing deep learning efficiency, pushing the limits of compression, speed, and task generalization—both in high-resource and embedded domains (Muir, 2017, Pires et al., 2023, Williams et al., 2023, Belcak et al., 2023, Liu et al., 2024, Balderas et al., 2023, Charalampopoulos et al., 2024).