Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Feed-Forward Networks

Updated 2 April 2026
  • Efficient Feed-Forward Networks are a family of innovations designed to reduce redundancy in deep FF modules, thereby enhancing computational, memory, and deployment efficiency.
  • They leverage techniques such as parameter sharing, conditional activation, and explicit algebraic optimization to achieve significant speedups and model compression.
  • Empirical results demonstrate up to 220× speedup, >40% parameter reduction, and maintained or improved accuracy across language, vision, and other domains.

An Efficient Feed-Forward Network (EFFN) is a family of architectural and algorithmic innovations that dramatically improve the computational, memory, and deployment efficiency of feed-forward neural networks, with a particular emphasis on modern deep architectures such as Transformers, Mixture-of-Experts, and dense multi-layer perceptrons. EFFN designs exploit architectural redundancy, conditional computation, explicit algebraic optimization, and adaptive parameter compression to either substitute or enhance conventional dense feed-forward modules. Empirical evidence demonstrates that EFFNs can achieve substantial speedups, aggressive model compression, and even accuracy improvements relative to standard baselines, across domains including language modeling, vision, and general supervised learning.

1. Motivations and Core Principles

Feed-forward networks constitute the bulk of learnable parameters and floating-point operations in deep neural architectures; for instance, Transformer-style FFNs account for ∼2/3 of parameters and latency in state-of-the-art LLMs (Liu et al., 2024). Historically, these modules exhibit high degrees of redundancy—either across depth (multiple FFNs per layer with shared or similar functions), within width (many neurons remain idle or underutilized per input), or in parameterization (over-parameterized networks yielding compressible solutions). EFFN frameworks seek to systematically eliminate this redundancy through one or more of:

  • Parameter sharing across layers and/or tokens (shared wide FFNs)
  • Conditional activation and routing (tree-based FFF/EFFN, expert selection)
  • Heavy-hitter neuron identification and selective resource allocation
  • Model pruning with transfer learning for width/depth reduction
  • Closed-form or explicit algebraic solutions for weights in supervised training
  • Mixture-of-Experts analogues with load-balancing and master expert enhancements

These approaches are motivated by the need to minimize both the wall-clock and memory requirements for deployment on commodity or low-resource hardware, without sacrificing task accuracy or model robustness.

2. Parameter Sharing, Wide Sharing, and Redundancy Removal

The standard approach in Transformers is a stack of LL independently parameterized 2-layer FFNs, each mapping dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}. Empirically, a large portion of these FFNs learn overlapping “key-value” memories, leading to redundancy (Pires et al., 2023). EFFN strategies address this by:

  • Removing decoder-side FFNs entirely, replacing them with identity mappings or bypasses.
  • Tying all encoder-side FFNs to a single shared module: for NencN_{\text{enc}} encoder layers, a single FFN is used, re-applied identically at each depth.

To compensate, the shared FFN’s hidden size DsharedD_{\text{shared}} is increased: Dshared=fD_{\text{shared}} = f (standard), DsharedNencfD_{\text{shared}} \approx N_{\text{enc}} f (recovers baseline accuracy with fewer parameters), Dshared=LfD_{\text{shared}} = L f (equal to total original FFN capacity). This technique yields significant parameter savings (>>40%), with speedups of 20–24% in decoding, and—when sufficiently widened—can even surpass the baseline BLEU in machine translation (+0.9 BLEU for WMT22 En→De at identical parameter count) (Pires et al., 2023).

These results show that the bulk of Transformer FFN parameters are redundant across depth, and that wide, shared EFFNs are both more efficient and potentially more accurate.

3. Conditional, Split, and Fast Feed-Forward Computation

Several EFFN implementations use conditional computation—only evaluating a small subnetwork per input, based on a learned routing structure. The FFF (Fast Feed-Forward Network) (Belcak et al., 2023), and its enhanced variant eFFF (Charalampopoulos et al., 2024), use a binary tree of DD differentiable sigmoid gates atop 2D2^D small “leaf” FF blocks (experts). During training, the entire tree is softly traversed for differentiable mixture output: dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}0 where dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}1 is the product of binary gate probabilities along the path, and dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}2 is the small expert output. At inference, each gate is hard-thresholded, and only one expert is activated: dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}3

In terms of computational complexity, a standard layer of width dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}4 costs dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}5 per input, while FFF/EFFN variants cost only dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}6, with dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}7 the leaf width. Empirically, this achieves up to 220× speedup over dense FFNs and preserves dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}894% of original accuracy in ViT-like architectures (Belcak et al., 2023). Enhanced FFF introduces two further elements:

  • Load balancing penalty, pushing routing probabilities to use all leaves equally (reduces run-to-run performance variance from ±29% to ±1–3% on MNIST).
  • Addition of a small Master Leaf “global expert” whose output is always included in the prediction: dmodeldffdmodeld_{\text{model}} \to d_{\text{ff}} \to d_{\text{model}}9 This improves accuracy by up to 3% on test sets and substantially reduces output variance compared to vanilla FFF (Charalampopoulos et al., 2024).

4. Explicit Algebraic Solutions and Layerwise Optimization

A further avenue for EFFN efficiency is the derivation of explicit, closed-form solutions for FFN weights, avoiding gradient-based iterative training. The SAFFU approach (Williams et al., 2023) derives a “column-translation” solution for single-layer softmax-FFNs, based on co-occurrence statistics: NencN_{\text{enc}}0 where NencN_{\text{enc}}1 is the co-occurrence of input component NencN_{\text{enc}}2 and output category NencN_{\text{enc}}3, and NencN_{\text{enc}}4.

For multi-layer networks, weights are sequentially solved given explicit targets at each layer, possibly in concert with neural attention (e.g., Self-Attentive Feed-Forward Units). This approach enables rapid prototype training, allowing ablation of hundreds of architectural variants on modest datasets in orders-of-magnitude less time than backprop alone. Empirically, explicit initialization followed by backprop-based fine-tuning yields lower perplexity (test PPL ≈ 23.8 vs. 64) and enables parameter-efficient, well-generalized models for low-resource or embedded deployment (Williams et al., 2023).

5. Compression, Heavy-Hitter Subnetworks, and Pruning

Highly efficient EFFN variants split the FFN into two subnetworks based on neuron “activity” or importance. The FFSplit strategy (Liu et al., 2024) identifies a heavy-hitter set NencN_{\text{enc}}5 of neurons (most of the NencN_{\text{enc}}6 norm of activations for any input), executing the following process:

  • For each FFN neuron, compute average per-neuron output norm over a calibration set.
  • Select top-NencN_{\text{enc}}7 neurons (heavy hitters) as NencN_{\text{enc}}8, the rest as NencN_{\text{enc}}9.
  • Partition DsharedD_{\text{shared}}0 into DsharedD_{\text{shared}}1, define DsharedD_{\text{shared}}2 and DsharedD_{\text{shared}}3.
  • Assign high capacity to DsharedD_{\text{shared}}4 (full precision), apply strong compression (low-rank SVD, quantization) to DsharedD_{\text{shared}}5.
  • Fine-tune to recover accuracy.

In BERT-Base and BERT-Large models, FFSplit achieves a 43% parameter reduction with accuracy drops of ≤1% and DsharedD_{\text{shared}}6–DsharedD_{\text{shared}}7 speedup on LLM inference (Liu et al., 2024). When integrated with extreme quantization (e.g., 3-bit weight quant), FFSplit recovers much of the perplexity loss, showing robust generalization under aggressive compression.

Another orthogonal approach prunes entire neurons based on low output variance, transferring pruned neuron means into biases and retraining. This achieves 70–99% parameter reduction with accuracy preservation or improvement across datasets, outperforming classical layer-wise pruning pipelines (Balderas et al., 2023).

6. Applications, Key Results, and Limitations

EFFNs have been demonstrated in multiple domains:

  • Efficient approximations of dynamic recurrent nets for attractor computations, enabling single-forward-pass inference in DsharedD_{\text{shared}}8 time, maintaining noise rejection and WTA behaviors (Muir, 2017).
  • Transformer LLMs with wide-shared or split FFNs, supporting translation, BERT, and OPT-class models (Pires et al., 2023, Liu et al., 2024).
  • Vision Transformers where fast (conditional) FFNs replace dense layers, attaining large speedups with minimal accuracy loss (Belcak et al., 2023).

Empirical performance across techniques is summarized below:

Paper Main Approach Compression/Speedup Accuracy/Task Impact
(Pires et al., 2023) Shared+Wide FFN 20–40% param. reduction, +24% spd +0.9 BLEU on WMT22, baseline+
(Liu et al., 2024) Split heavy-hitter FFN 43.1% param, 1.25–1.56× speedup ≤1% drop, 5× quantization PPL
(Belcak et al., 2023) Tree-based (FFF) up to 220× FFN, 6× MoE 94.2% ViT accuracy retained
(Williams et al., 2023) Explicit algebraic init 100× training speed, 3–10× less data Test PPL ≈24 vs. 64
(Balderas et al., 2023) Pruning + retrain 70–99% param, 2–40× smaller ACC↑ or preserved
(Charalampopoulos et al., 2024) Load-bal, master leaf Accuracy +3%, var.↓~10× (MNIST/FMNIST) Sublinear inference cost

Limitations and caveats include:

  • Extreme compression can degrade performance in out-of-distribution settings.
  • Conditional computation architectures may introduce overhead if not efficiently implemented in target hardware.
  • For autoregressive or decoder-only models, wide sharing or highly conditional FFNs create latency bottlenecks.
  • Explicit algebraic solutions assume restricted activation and loss forms, possibly limited for large vocabulary or regression tasks.
  • Most EFFNs focus solely on FFN modules, leaving attention and embedding submodules as potential further bottlenecks.

7. Outlook and Integration into Modern Deep Learning

EFFN methodologies provide a comprehensive toolkit for scaling, compressing, and accelerating deep neural networks without compromising learning capacity. Key practical guidelines include:

  • For Transformer encoder-decoders: share FFN layers, widen as needed, and drop decoder-side FFNs for speed.
  • For resource-constrained deployment: partition FFNs into heavy- and sparse-components, compress accordingly, fine-tune for recovery.
  • For hardware-critical inference: tree-based EFFNs achieve maximal reduction in runtime at minimal expressivity loss.
  • For training efficiency and architecture search: closed-form layerwise EFFNs provide rapid ablation and warm-start points for fine-tuning.
  • Add load balancing and global expert terms to stabilize conditional EFFN training.

EFFNs thus serve not as a single architecture, but as a paradigm for advancing deep learning efficiency, pushing the limits of compression, speed, and task generalization—both in high-resource and embedded domains (Muir, 2017, Pires et al., 2023, Williams et al., 2023, Belcak et al., 2023, Liu et al., 2024, Balderas et al., 2023, Charalampopoulos et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Efficient Feed-Forward Network (EFFN).