VersatileFFN: Dual-Path Efficient FFN
- VersatileFFN is a novel dual-path feed-forward network that dynamically uses both width and depth expansion to increase effective LLM capacity.
- Its architecture employs a width-versatile path for mixture-of-experts behavior and a depth-versatile path for recursive refinement using shared weights.
- Empirical evaluations show VersatileFFN improves performance and efficiency, achieving lower perplexity and higher accuracy than dense FFNs and MoE baselines.
VersatileFFN is a parameter-efficient feed-forward network (FFN) architecture for LLMs that achieves increased effective capacity in both width and depth dimensions without a material increase in parameter count. Designed as a drop-in replacement for conventional FFNs in Transformer blocks, VersatileFFN leverages a dual-path structure inspired by the dual-process theory of cognition, dynamically allocating computation across tokens to maximize efficiency and capacity. Both adaptive pathways reuse the same underlying weights, thus shifting the capacity-computation trade-off from memory-bound parameter scaling to compute-bound iterative and expert routing mechanisms (Nie et al., 16 Dec 2025).
1. Motivation and Theoretical Foundation
Modern LLMs exhibit improved performance as model size increases, but scaling to hundreds of billions of parameters incurs prohibitive memory costs. Conventional parameter-efficient methods such as pruning, quantization, and low-rank adapters act exclusively on pretrained weights, compressing the original architecture without increasing its representational capacity and thus encountering a “representational ceiling.”
VersatileFFN is explicitly constructed to improve parameter efficiency while expanding the effective capacity of the FFN. Drawing motivation from Kahneman’s dual-process theory—where System 1 denotes fast, shallow reasoning, and System 2 denotes slow, iterative cognition—VersatileFFN splits the feed-forward computation into two distinct yet parameter-shared paths. This architecture advances the expressivity of LLMs by adaptively providing either sparse, expert-driven width expansion or recursive, depth-driven refinement per token in a given batch (Nie et al., 16 Dec 2025).
2. Architecture and Pathways
2.1 Base FFN Formulation
The underlying backbone of VersatileFFN is a standard two-layer MLP with nonlinearity , formulated as: or, within a Transformer block:
2.2 Width-Versatile Path
The width-versatile pathway achieves mixture-of-experts (MoE)-like behavior without introducing new parameters by partitioning a single FFN into virtual sub-experts. Given a hidden dimension and each expert width , a stride defines subspace indices for each expert. For expert :
Gating logits are computed by (), followed by Top-K activation and softmax normalization to select sparse expert mixtures per token. The output is:
2.3 Depth-Versatile Path
The depth-versatile pathway recursively applies the entire FFN up to times. For each token: where is the full shared FFN. A loop-predictor produces logits ; the iteration count is sampled via Gumbel-Softmax: Output is a weighted sum over iterations: (discrete at inference via ).
3. Dynamic Gating and Routing
Token-level difficulty is estimated as the expected loop count: yielding a fusion gate , such that
Easy tokens () rely on the width path, while hard tokens () traverse the depth path. This difficulty-aware gating mechanism allocates computation per token adaptively (Nie et al., 16 Dec 2025).
4. Efficiency, Parameter Reuse, and Memory-Compute Trade-Offs
VersatileFFN reuses the same , for both width and depth pathways. Additional overhead is limited to small matrices for routing and loop prediction (, ), representing less than 0.1% extra parameters.
| Model | Params (M) | FFN FLOPs/Token |
|---|---|---|
| Base | 354.71 | 377.49 M |
| VersatileFFN | 354.90 | 1 236.08 M |
| MoE (8 experts) | 543.59 | 471.86 M |
VersatileFFN trades off approximately 3× the base FLOPs for increased effective capacity, while utilizing ~45% fewer FLOPs than a 6-loop depth-only model with higher accuracy. Parameter footprint remains nearly identical to the base model (Nie et al., 16 Dec 2025).
5. Empirical Evaluation
Evaluation was performed via zero-shot prompting on diverse benchmarks (PIQA, HellaSwag, OpenBookQA, SciQ, ARC-easy, ARC-challenge, CommonsenseQA, Winogrande) using the OLMES protocol.
| Model | 354M Param Avg. Acc. | 720M Param Avg. Acc. |
|---|---|---|
| Base | 47.98% | 53.83% |
| MoE (8, Top-2) | 51.48% | 55.87% |
| 2/4/6-Loop | 51.47/51.98/51.94% | 55.83/56.33/56.55% |
| VersatileFFN | 52.33% | 57.03% |
VersatileFFN outperforms both base and MoE models of similar or larger parameter count, and achieves lower perplexity on held-out data. Ablation studies indicate each adaptive branch alone improves over the base, while the combined dual-path with difficulty-aware gating attains maximum performance gains. Performance improves even without continued pretraining, with a +3.16% absolute gain over baseline (Nie et al., 16 Dec 2025).
6. Practical Considerations and Hyperparameters
The architecture enables virtually unlimited effective capacity through increased computation rather than parameter growth:
- Per-token adaptive routing ensures computational effort targets hard tokens.
- Overhead in routing and prediction is minimal, but inference requires ≈3× base FLOPs.
- Potential latency increases may arise for very large loop counts.
Recommended settings from empirical study:
- Virtual experts with Top-K=2.
- Maximum recursion , Gumbel-Softmax temperature annealed from 5.0 to 0.1.
- Auxiliary load-balancing loss weight .
The approach is “compute-heavy, memory-light,” outperforming both dense and MoE baselines with a comparable or reduced parameter budget (Nie et al., 16 Dec 2025).
7. Extensions, Limitations, and Open Problems
Limitations include increased FLOPs, potential inference latency, and minor overhead for routing mechanisms. Possible extensions and open questions involve:
- Application of dual-path adaptive reuse to self-attention modules.
- Integration of VersatileFFN with quantization or pruning techniques for further memory efficiency.
- Exploration of alternative gating or expert slicing strategies.
- Theoretical study of convergence and expressivity trade-off between recursive (depth-wise) and expert (width-wise) mechanisms within parameter reuse frameworks.
VersatileFFN constitutes a novel architecture for LLMs, demonstrating that adaptive, parameter-reused computation in both width and depth dimensions can exceed the representational capacity and practical performance of both conventional dense and MoE-based FFNs under constant parameter budgets (Nie et al., 16 Dec 2025).