Efficient Activation Functions for Sparse LLMs

Updated 15 December 2025

The paper demonstrates that sparse activation functions such as ReLU and dReLU enable up to 90-95% neuron inactivation, significantly reducing inference computation and memory footprint.
It details methods like ReLUfication, progressive regularization, thresholding, and Top-K selection that maintain accuracy while inducing high activation sparsity.
The study highlights speedups up to 4.5× and hardware optimizations that harness predictable zero activations for runtime efficiency in LLMs.

Efficient activation functions for sparse LLMs enable substantial reductions in inference computation and memory footprint by facilitating the runtime skipping of weakly-contributing neurons. Activation sparsity emerges either intrinsically—when the activation function itself outputs exact zeros (e.g. ReLU, dReLU)—or functionally, when activations are near zero and can be thresholded or pruned post hoc. The field pursues methods for both maximal sparsity and minimal downstream performance degradation, across a landscape of gated-FFN architectures, regularization schedules, thresholding procedures, and predictor-driven or data-free pruning rules.

1. Activation Functions and Intrinsic Sparsity Properties

Most LLMs employ activations such as GELU ( $x \cdot \Phi(x)$ ) and SiLU/Swish ( $x \cdot \sigma(x)$ ) that yield almost everywhere non-zero outputs, resulting in dense intermediate representations. These functions are smooth, but their output distributions are suboptimal for hardware-accelerated sparse inference, as zero-skipping kernels require exact zeros to prune matrix-vector operations at runtime (Song et al., 21 Feb 2024, Mirzadeh et al., 2023). In contrast, ReLU ( $\max(0, x)$ ) clips all negative pre-activations to zero, producing structural elementwise sparsity in FFN layers. Empirical studies show that post-fine-tuning, ReLU-based LLMs exhibit $s \approx 0.90$ –$0.95$ sparsity, meaning 90–95% of neurons are inactive per token for typical parameterizations (Mirzadeh et al., 2023, Song et al., 21 Feb 2024). Shifted ReLU ( $\mathrm{ReLU}(x-b)$ ) and thresholded variants further increase the region of zero outputs, while squaring in ReLU $^2$ ( $[\max(0, x)]^2$ ) accentuates the separation between high- and low-magnitude activations, driving even higher skip ratios without excessive signal loss (Zhang et al., 6 Feb 2024). The recently proposed dReLU applies ReLU on both gate and up FFN projections, ensuring any negative in either silences the neuron, yielding $\sim$ 90% or greater inherent sparsity in modern Mistral and Mixtral models (Song et al., 10 Jun 2024).

2. Methods for Inducing and Controlling Sparsity

Several frameworks exist for pushing LLMs toward high activation sparsity while preserving accuracy:

ReLUfication and Fine-Tuning: Directly substituting GELU/SiLU/Swish with ReLU and lightly fine-tuning restores performance with minimal accuracy loss, reaching up to 67% sparsity in baseline models and upwards of 89% with progressive regularization (Song et al., 21 Feb 2024).
Progressive Regularization (ProSparse): Adds an $L_1$ sparsity regularizer to FFN outputs, with a regularization coefficient $\lambda(t)$ increased along a multi-stage sine schedule, avoiding abrupt distribution shifts and supporting adaptation (Song et al., 21 Feb 2024). Threshold shifting after ProSparse further prunes small activations with negligible impact, allowing up to 89.32% sparsity and 4.52 $\times$ speedup on LLaMA2-7B.
Thresholding and Performance-Aware Control (PPL- $p\%$ Sparsity): Identifies per-layer magnitude thresholds $\epsilon^*$ that maximize sparsity while bounding perplexity increase to a fixed percent (e.g., $p=1\%$ ), enabling safe post hoc pruning in both ReLU and non-ReLU LLMs (Luo et al., 4 Nov 2024).
Top- $k$ Selection (MoC, LaRoSA): Selects the $K$ highest-scoring channels per token via native gating signals (MoC) or principal-component rotations (LaRoSA), achieving token-wise dynamic sparsity that is robust across model scales and domains (Wu et al., 12 Nov 2025, Liu et al., 2 Jul 2025).
Rank-Aware Decomposition (R-Sparse): Decomposes input activations into sparse and residual low-rank components post-activation; the latter is addressed via Singular Value Decomposition of weights, enabling training-free sparsification in non-ReLU functions for Llama/Mistral FFNs (Zhang et al., 28 Apr 2025).

3. Sparsity–Performance Trade-Offs and Empirical Scaling Laws

Empirical investigations consistently report that ReLU-based and dReLU models allow substantially higher activation sparsity for a given accuracy budget than GELU, SiLU, or SwiGLU (Song et al., 21 Feb 2024, Song et al., 10 Jun 2024, Luo et al., 4 Nov 2024). In performance-aware metrics, the activation ratio (fraction of nonzero channels) in ReLU networks decreases as the training corpus size increases, with a fitted log-space power-law $A_\mathrm{ReLU}(D) = \exp(-c D^{\alpha} + b) + A_0$ (where $A_0$ is the limiting ratio) (Luo et al., 4 Nov 2024). SiLU models show the opposite trend, with increasing activation ratios under more data. Downstream accuracy losses are negligible at moderate sparsity thresholds (PPL-1%) but become pronounced for more aggressive pruning (PPL-5%).

Activation sparsity is also scale-robust: The limiting activation ratio in ReLU models ( $A_0$ ) varies weakly from 6.1% to 7.8% across 0.1B–1.2B parameter scales. Functional and architectural factors—such as depth-to-width ratio—modulate the per-layer sparsity, with deeper models yielding lower activation ratios and hence higher sparsity (Luo et al., 4 Nov 2024). In the MoE setting, combining expert routing and neuron-level dReLU sparsity restricts per-token active parameters to $<10\%$ of the total, with benchmark accuracy matched or exceeded (Song et al., 10 Jun 2024).

Activation Function	Typical Sparsity (%)	Downstream Quality Impact
ReLU (fine-tuned)	66–89	$<1$ point (ProSparse)
Shifted ReLU	69–71	$<1$ point
Fixed $L_1$ (aggressive)	>91	severe quality degradation
dReLU (TurboSparse)	$\approx$ 90	None/measurable improvement
ReLU $^2$	90–95	$<0.1$ \% accuracy drop
SwiGLU, SiLU, GELU	1–40	Baseline, dense inference
MoC Top- $K$ SwiGLU	K/d_ffn (configurable)	$\ll$ 1\% PPL increase

4. Hardware Affinity and Inference Acceleration

Sparse activation functions directly enable runtime skipping in matrix-vector multiplications via elementwise masking. ReLU/dReLU yield predictable zeros, allowing for zero-skipping kernels and fused GEMV implementations (Shin et al., 19 Nov 2024, Song et al., 21 Feb 2024). ReLU $^2$ accentuates token-to-token mask stability, improving temporal reuse (reuse ratios up to 0.45 at high sparsity) and spatial neuron co-locality (top-average coactivation gap $\sim$ 0.25) (Zhang et al., 6 Feb 2024). Practical kernels leverage sign-bit packing (SparseInfer), column-major weight storage, fused Top- $K$ selection, and system primitives for selective loading (Wu et al., 12 Nov 2025, Song et al., 10 Jun 2024, Liu et al., 2 Jul 2025).

Measured speedups scale with achieved sparsity: ProSparse, dReLU, and ReLU $^2$ methods report $1.3\times$ – $4.5\times$ decoding gains on A100 or consumer CPUs and up to $22\times$ acceleration on mobiles for MoE models (Song et al., 10 Jun 2024). These gains are robust across batch sizes and persist when combined with quantization (W4A16, INT4) (Liu et al., 2 Jul 2025).

5. Training-Free and Predictor-Based Sparsification Techniques

For architectures where retraining or fine-tuning is impractical, several training-free schemes efficiently induce activation sparsity:

SparseInfer: Predicts per-row zero activations using XOR-popcount of sign bits in inputs and weights. An adaptive parameter $\alpha$ tunes precision-recall; accuracy loss is limited to $<1\%$ while yielding up to $1.8\times$ speedup against dense inference (Shin et al., 19 Nov 2024).
LaRoSA: Applies layerwise PCA-style rotations to input activations, performing Top- $K$ selection in the rotated basis and merging the rotation into adjacent weights. Provides stable speedup and negligible accuracy loss up to 50% sparsity (Liu et al., 2 Jul 2025).
MoC: Uses SwiGLU’s native gating scores for per-token Top- $K$ selection; only the K selected channels are activated, reducing both forward storage and memory transfers. MoC integrates seamlessly with gradient checkpointing and hardware semi-sparse GEMM kernels for further efficiency (Wu et al., 12 Nov 2025).
R-Sparse: Decomposes activations into thresholded sparse channels and low-rank residuals using SVD; functions on arbitrary activation types with no retraining and matches dense accuracy at 50% sparsity (Zhang et al., 28 Apr 2025).

6. Design Recommendations and Future Directions

Research indicates that optimizing both architectural (activation function choice, depth-width ratio) and procedural (progressive regularization, data mixture for ReLUfication) elements is crucial for efficient sparse LLMs. Recommendations include favoring ReLU- or dReLU-based FFN blocks for maximal intrinsic sparsity; supplementing activation thresholding and Top- $K$ selection for non-ReLU networks; using performance-aware metrics (PPL-1% sparsity) for safe deployment; and targeting deeper (smaller width-depth ratio) models within fixed parameter budgets (Luo et al., 4 Nov 2024).

Open directions include the extension of progressive sparsification and thresholding to attention blocks, automatic tuning of sparsity regularization schedules, joint optimization of activation and weight sparsity, and further hardware-software co-design to expose sparsity knobs to runtime kernels (Song et al., 21 Feb 2024, Szatkowski et al., 30 Aug 2025). The rising prevalence of large, quantized, and expert-based models further motivates research into compositional and adaptive sparse activation strategies.

7. Comparative Summary

Recent work attests to the centrality of activation function design for unlocking efficient sparse LLM inference. The table below summarizes the leading approaches:

Method	Activation Type	Sparsity Achieved	Training Required	Inference Speedup	Reference
ProSparse	ReLU + L₁/sine	89% (LLaMA2-7B)	Short fine-tuning	4.5× (PowerInfer)	(Song et al., 21 Feb 2024)
SparseInfer	ReLU (predictor)	≈90%	None (training-free)	1.8×	(Shin et al., 19 Nov 2024)
TurboSparse/dReLU	dReLU	90–97% (MoE)	Data-mixture SFT	2–5×	(Song et al., 10 Jun 2024)
ReLU $^2$	Squared-ReLU	95%	Train from scratch	Hardware-aligned	(Zhang et al., 6 Feb 2024)
LaRoSA	Rotation+TopK	40–50%	Calibration data	1.3× (A100)	(Liu et al., 2 Jul 2025)
MoC	SwiGLU + TopK	13–50%	None	1.13–1.52×	(Wu et al., 12 Nov 2025)
R-Sparse	Generic, Rank-Sparse	50%	None	1.43×	(Zhang et al., 28 Apr 2025)
PPL-p% Metric	Any	1–90%*	N/A	Theory-guided	(Luo et al., 4 Nov 2024)

All methods above report matching or only minimally degraded performance (typically $<1$ pt on MMLU/GSM8K/BBH/perplexity) at mainline sparsity settings.

*PPL-p% sparsity (with $p=1\%$ ) yields safe, performance-aware thresholds across arbitrary activations (Luo et al., 4 Nov 2024).