Fast Feedforward Networks (FFFs)

Updated 11 June 2026

Fast Feedforward Networks (FFFs) are neural architectures that replace dense layers with conditional computation schemes, reducing inference cost to O(log w) per input.
They employ balanced binary-tree routing and auxiliary losses like hardening and load balancing to ensure efficient and interpretable training and inference.
FFF designs realize significant speedups (up to 220×) in models such as vision transformers and language models by sparsifying active neurons while maintaining high accuracy.

Fast Feedforward Networks (FFFs) are a family of neural architectures that achieve substantial inference-time acceleration and, in some formulations, increased model interpretability by replacing or modifying standard feedforward computations with structured, conditional, or algorithmically efficient alternatives. These approaches depart from classical dense layers, which entail $O(w)$ or $O(NH)$ computation per layer, either by learning to sparsify computation conditionally (tree-based routing, mixture-of-experts, orientation vectors), approximating fixed-point responses of slower recurrent architectures, or constructing each layer’s parameters in closed form based on statistical properties of the input.

1. Conditional Computation and Binary-Tree Routing Architectures

The core innovation behind modern FFFs is the decoupling of layer width from inference cost by arranging neurons into a balanced binary tree and activating only a logarithmic-depth path per input. For an input $\mathbf{x}$ , a series of internal node decisions, each parameterized by a single-neuron network with sigmoid activation $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ , selects a unique leaf subnetwork. Only this small “leaf” feedforward net (width $\ell \ll w$ ) processes the input, yielding total per-input cost $O(\log w + \ell)$ for width $w = 2^d\ell$ versus $O(w)$ for dense layers. During training, the output is a soft mixture of all leaves weighted by the product of gating probabilities along each path; in inference, the tree is “hardened,” activating exactly one leaf (Belcak et al., 2023).

This design accommodates auxiliary “hardening” losses, load balancing, and “master leaf” global expert modules. For example, a regularizer $L_{\text{harden}} = \sum_{x, N} H(p_N(x))$ (where $H$ is binary entropy) encourages routing decisions to saturate at $O(NH)$ 0 or $O(NH)$ 1, ensuring soft-to-hard inference transfer. Load balancing adds $O(NH)$ 2, where $O(NH)$ 3 is the fraction of examples routed to leaf $O(NH)$ 4 over a minibatch, preventing “leaf starvation.” A “master leaf” is a global fallback network with a learned scalar mixture weight, combined with the selected leaf at inference: $O(NH)$ 5 (Charalampopoulos et al., 2024).

This structure admits theoretical analysis: the cost to reach a leaf is $O(NH)$ 6 sigmoid evaluations, and only one leaf network is evaluated. Under ideal conditions, the inference cost is $O(NH)$ 7, and, when deployed at scale (e.g., all feedforward/MLP layers in Transformers), FFFs deliver speedups of $O(NH)$ 8– $O(NH)$ 9 for large $\mathbf{x}$ 0, with only minor drops in accuracy when $\mathbf{x}$ 1 is sufficiently large relative to $\mathbf{x}$ 2 (e.g., maintaining $\mathbf{x}$ 3 of baseline ViT accuracy at $\mathbf{x}$ 4 active neurons per layer) (Belcak et al., 2023).

2. Implementation, Training Dynamics, and Performance

During training, FFFs use soft routing. Each input’s final representation is $\mathbf{x}$ 5, with $\mathbf{x}$ 6 given by a product of “left” and “right” gating probabilities for each internal node on the path to leaf $\mathbf{x}$ 7. The loss function is typically a standard prediction loss (e.g., cross-entropy) plus the hardening entropy and load-balance components, with Adam as the optimizer. To facilitate convergence, training proceeds in stages: initial low hardening/entropy, optionally followed by higher hardening to produce discrete routing (Belcak et al., 2023, Charalampopoulos et al., 2024).

At inference, each internal node makes a hard left/right decision (thresholding at $\mathbf{x}$ 8), selecting one path to a single leaf, resulting in a significant reduction in floating-point operations and memory accesses.

Empirically, FFFs achieve:

Model/Layer	Active Neuron %	Accuracy (G_A)	Speedup (vs FF)
ViT + FFF ( $\mathbf{x}$ 9)	$p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 0	$p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 1 baseline	$p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 2

Adding load balancing increases both training and test accuracy by up to $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 3 and $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 4 absolute, and reduces variance $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 5– $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 6 across runs. The master leaf further increases reliability and accuracy, especially for data not adequately captured by partitioned leaves (Charalampopoulos et al., 2024).

3. Connections to Mixture-of-Experts and Alternative Fast FFN Designs

FFF tree architectures are structurally analogous to Mixture-of-Experts (MoE), but with key differences:

MoE uses a dense gating network to score all experts per input, activating the top- $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 7, yielding $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 8 cost (typically $p_{N}(\mathbf{x}) = \sigma(\mathbf{w}_N^\top \mathbf{x} + b_N)$ 9 is a small constant) and requiring noise injection for balanced expert utilization.
FFFs use $\ell \ll w$ 0 gating (hierarchical binary decisions) and deterministic (noiseless) routing during inference. Balance and hardening are achieved with lightweight auxiliary losses.
Empirical convergence is fast and stable, with no need for stochastic gating or pretraining (Belcak et al., 2023).

Alternative FFF variants include flattened convolutions and analytical/statistics-driven FFN parameterizations:

Flattened Convolutional Nets: Decompose $\ell \ll w$ 1D convolution into separate $\ell \ll w$ 2D factorized kernels along channel, vertical, and horizontal dimensions, reducing parameter count and compute by $\ell \ll w$ 3– $\ell \ll w$ 4, with up to $\ell \ll w$ 5 speedup and no accuracy loss (Jin et al., 2014).
Feedforward via Statistical Construction: Parameterizing each convolutional layer by the eigenbasis of local statistics (e.g., via the Saab transform, a bias-adjusted PCA variant) and constructing classification FC layers via closed-form linear regressors. This enables one-pass training with interpretability and competitive robustness but lags backpropagation-trained networks by a few percent in accuracy (Kuo et al., 2018, Eswaran et al., 2015).

4. Fast Feedforward Approximations to Iterative or Recurrent Inference

FFF paradigms also include architectures that “cache” expensive iterative or recurrent computational processes in a fast, single-pass proxy. For example, given a recurrent network with dynamics

$\ell \ll w$ 6

whose stable fixed point $\ell \ll w$ 7 defines the steady-state response for constant input, one can train a two-layer ReLU feedforward network to directly map $\ell \ll w$ 8 by minimizing mean-squared error over numerically derived fixed points (Muir, 2017). In such a paradigm, the mapping cost collapses from (potentially unbounded) $\ell \ll w$ 9 to deterministic $O(\log w + \ell)$ 0, with empirical speedup factors $O(\log w + \ell)$ 1– $O(\log w + \ell)$ 2 and preservation of recurrent computation motifs (competition, sharpening, noise rejection).

Similarly, in deep energy-based architectures (Hopfield, Boltzmann machines), layerwise auto-encoder consistency ensures that a single bottom-up feedforward pass lands near the energy fixed point, requiring little to no iterative relaxation. This is contingent upon every layer pair being trained to form a good autoencoder, such that the reconstruction error controls the deviation from fixed-point activations (Bengio et al., 2016). This property grants the FFF approach both computational speed and a plausible biological parallel (as mutually predictive dendritic branches in cortical pyramidal neurons).

5. Theoretical Properties, Complexity, and Scaling

The efficiency of FFFs is underpinned by their scaling laws:

In binary-tree FFFs, for $O(\log w + \ell)$ 3 total neurons and $O(\log w + \ell)$ 4 depth with $O(\log w + \ell)$ 5-width leaves, $O(\log w + \ell)$ 6 and inference cost is $O(\log w + \ell)$ 7.
Orientation Vector methods (Kolmogorov-inspired) demonstrate that for classification tasks with $O(\log w + \ell)$ 8 separable clusters in $O(\log w + \ell)$ 9, the number of hyperplanes (and therefore hidden nodes) required scales as $w = 2^d\ell$ 0, not polynomial in $w = 2^d\ell$ 1 or $w = 2^d\ell$ 2. Cluster assignment is realized in closed-form by encoding cluster identities as $w = 2^d\ell$ 3-bit orientation codes determined by affine partitioning, with the mapping being invertible on discrete codebooks (Eswaran et al., 2015).

In mixture-based and fast-forwarding architectures, balancing output channel utilization and minimizing routing entropy regularization are critical for preventing degenerate or underutilized subnetworks. The architectures empirically exhibit quasi-linear or superior scaling as the number of clusters/expert regions increases, with training and classification (test) cost remaining sublinear in total possible expert/region count.

6. Applications, Limitations, and Future Directions

FFFs have been adopted in vision transformers, LLMs (UltraFastBERT), and compact convolutional networks. In UltraFastBERT, replacing dense FFN layers with tree-structured FFFs reduced active neurons per inference to $w = 2^d\ell$ 4 of the total (12 of 4095 per layer), while retaining $w = 2^d\ell$ 5 of GLUE benchmark score and delivering $w = 2^d\ell$ 6– $w = 2^d\ell$ 7 speedup relative to highly optimized baseline implementations (Belcak et al., 2023). CNN flattening directly yields $w = 2^d\ell$ 8 feedforward speedup with negligible accuracy impact (Jin et al., 2014). Integrating FFFs with Transformer architectures and MoE variants further demonstrates application to large-scale attention-based models (Belcak et al., 2023, Charalampopoulos et al., 2024).

Current limitations include the need for robust batch-level routing (to avoid undertrained leaves), non-support for sequential or negative-phase dynamics in energy models, lack of full support in mainstream BLAS libraries (for ideal conditional execution), and possible trade-offs when shrinking leaf size to extreme values (accuracy degradation, increased sensitivity to initialization and batch regime). Research continues on adaptive balancing, hierarchical master-leaf structures, hybrid architectures combining hard routing with expert averaging, and end-to-end hardware support for conditional execution to achieve the asymptotic speedup.

FFFs thus encompass an algorithmic spectrum spanning conditional computation (binary-tree routing), fast statistical or explicit parameter construction, and learned approximations of fixed-point or iterative solutions. These methodologies are central to ongoing efforts to decouple capacity from computational cost and to endow feedforward models with flexible, interpretable, and efficient forms of inference (Belcak et al., 2023, Charalampopoulos et al., 2024, Muir, 2017, Belcak et al., 2023, Jin et al., 2014, Kuo et al., 2018, Bengio et al., 2016, Eswaran et al., 2015).