Sparse ReGLU-Based FFNs
- The paper introduces a novel sparse ReGLU-FFN layer that combines ReLU gating with dynamic neuron selection to cut compute by up to 7× while retaining competitive accuracy.
- It employs an adaptive thresholding mechanism (CETT) to activate only significant neurons, achieving around 88% sparsity with less than 1% accuracy drop.
- Hardware-aware optimizations such as sliding-window caching and block-sparse computation significantly enhance memory efficiency and inference speed.
Sparse ReGLU-based Feed-Forward Networks (FFNs) designate a class of FFN layers for large-scale neural LLMs in which the ReLU-Gated Linear Unit (ReGLU) activation is combined with explicit dynamic sparsity for computational and memory efficiency. This approach leverages the high activation sparsity inherent in ReGLU—where many neuron outputs are zero or near-zero—and introduces runtime selection of active neurons per token via thresholding mechanisms. As such, sparse ReGLU-FFNs often feature an order-of-magnitude reduction in compute and memory requirements while maintaining accuracy competitive with dense baselines, particularly in the context of LLMs (Zhang et al., 2024).
1. ReGLU Activation: Structure and Properties
In the ReGLU activation, the standard gated linear unit (GLU) paradigm is instantiated using the rectified linear unit (ReLU) as the gating nonlinear function. Given an input and learned weights , the ReGLU FFN computes, for each hidden neuron :
- Gating:
- Value:
The FFN output is then:
Key attributes:
- Piecewise-linearity provides large compact regions of exact zero, enabling efficient sparsity.
- The multiplicative gating produces an “information highway” effect, allowing expressive adaptation per-token.
- Empirically delivers pretraining loss comparable to SwiGLU and ReLU², with higher intrinsic activation sparsity (Zhang et al., 2024).
2. Sparse-Activation Framework in ReGLU FFNs
Sparse ReGLU-FFN models depart from classical sparsity notions predicated solely on ReLU outputs being exactly zero. Instead, neuron “inactivity” is determined by the output magnitude relative to an adaptively chosen threshold for layer . A neuron is considered skipped for a given token if .
To select , the cumulative error of tail truncation (CETT) criterion is used. For a given threshold ,
where omits all neurons with . Empirical results show CETT induces minimal (circa 1%) accuracy loss up to average sparsity for ReGLU (Zhang et al., 2024).
3. Implementation: Dynamic Sparse Inference Pipeline
Sparse ReGLU-FFN inference proceeds as follows:
- Gating computation:
- Mask determination: , yielding active neuron indices.
- Block gather: Rows/columns corresponding to active indices are retrieved for , , and
- Value projection:
- Gated outputs:
- Output accumulation: Dense or blockwise matrix-multiplication (with columns for active neurons)
Batch processing and windowed reuse of active indices across tokens further improve cache efficiency and hardware throughput. Block sizes are aligned to the accelerator’s architectural granularity, e.g., warp size (Zhang et al., 2024).
4. Sparsity–Accuracy Trade-offs and Empirical Metrics
Empirical investigation across several activation functions yields that, at CETT:
- ReGLU achieves sparsity with only a 0.9% performance drop, outperforming SwiGLU (75%, 0.8% drop) and ReLU (82%, 0.7% drop).
- FLOP reduction is directly proportional to the sparsity level. For ReGLU, up to hidden-layer computation reduction is observed, and 90% reduction in I/O (weight movement) can be achieved by combining reuse and block co-activation locality strategies.
- End-to-end single-token inference speedup reaches $5$– on modern accelerators for batch size $1$ (Zhang et al., 2024).
A representative summary table of activation characteristics is as follows:
| Activation | Dense Perf. | Sparsity at CETT=0.2 | End-to-End Accuracy Drop |
|---|---|---|---|
| SwiGLU | 100% | 75% | 0.8% |
| ReLU | 99.3% | 82% | 0.7% |
| ReGLU | 99.1% | 88% | 0.9% |
| ReLU | 99.4% | 92% | 0.6% |
All listed metrics are for 1.3B-parameter models (Zhang et al., 2024).
5. Hardware Optimization and Memory Considerations
Sparse ReGLU-FFNs benefit from both algorithmic and hardware-aware optimizations:
- Parameter reuse: Sliding-window caching across tokens achieves a reuse ratio of $0.38$ in ReGLU layers, which is higher than SwiGLU ($0.25$). For a window size , the reuse is more than double compared to .
- Co-activation block layout: By arranging neurons with high joint-activation probability in contiguous storage, blockwise loads minimize random-access overhead—yielding up to reduction in weight movement.
- Block size alignment: Optimal neuron-block size matches accelerator thread unit, maximizing tensor-core utilization (Zhang et al., 2024).
The index-gather and block-sparse computations are tuned so that their overhead remains less than the compute saved.
6. Production Guidelines and Best Practices
Deploying sparse ReGLU-FFNs involves the following practices:
- Per-layer threshold selection using CETT capped at about $0.2$.
- Predictor-based pruning: A small MLP predicts likely-inactive neurons, further improving efficiency by another $50$– while maintaining recall. This reduces unnecessary compute at inference with negligible error increase.
- Windowed index reuse and block-locality enforcement to further minimize memory traffic and improve hardware throughput.
- Avoid overtuning: Lowering CETT well below $0.1$ results in diminished returns and sharply reduced accuracy.
- Verify recall: Ensure that the predictor’s false-negative rate is below to keep accuracy penalties minimal () (Zhang et al., 2024).
7. Comparative Context and Extensions
ReGLU-based sparse FFNs represent a point along the spectrum of activation sparsity, with SwiGLU-based MoC FFNs (Wu et al., 12 Nov 2025) serving as a notable alternative. While MoC leverages the intrinsic sparsity of SwiGLU via top- gating per token and achieves – reduction in FFN activation memory and – inference speedup, ReGLU-based methods yield even higher effective sparsity (up to ) but require threshold-based pruning and prediction infrastructure. Both approaches are compatible with standard optimizers (AdamW with cosine decay), mixed precision, and gradient checkpointing, making them amenable to production-scale LLM pretraining and inference (Wu et al., 12 Nov 2025, Zhang et al., 2024).