Exploiting Activation Sparsity in LLMs: A Case for ReLU
LLMs have transformed artificial intelligence applications, but the computational demands during inference create challenges for deployment in resource-constrained environments. This paper investigates the role of activation functions and re-evaluates the potential use of the Rectified Linear Unit (ReLU) in LLMs. The paper explores activation sparsity to enhance model efficiency without significantly sacrificing performance, making the case for leveraging ReLU activations over alternatives like GELU and SiLU.
Activation Functions and Computational Load
The paper first challenges the trend favoring smoother activation functions in modern LLMs. Historically, alternatives such as GELU and SiLU have been preferred due to their marginal improvements in convergence and accuracy. However, through an experimental setup comparing these to ReLU, the paper finds that the performance differences are negligible when models are trained on substantial datasets. The authors argue that while smoother activation functions may offer slight performance gains, the increased computational cost during inference outweighs these benefits when efficiency is prioritized.
Activation Sparsity: Theoretical Insights and Empirical Results
A key element of this research is the discussion of activation sparsity—a phenomenon where a substantial portion of neurons remains inactive (zeroed-out) during forward passes of the network. The paper illustrates that ReLU induces significant activation sparsity, thereby reducing the number of floating-point operations (FLOPS) during inference. For example, in an OPT model using ReLU, the sparsity in some layers can exceed 90%, translating into a 32% reduction in computation needed for inference compared to baseline models using GELU or SiLU.
Practical Efficiency Gains Through "Relufication"
The authors introduce the concept of "relufication," which involves replacing existing activation functions with ReLU in pretrained LLMs and further optimizing the network structure. The paper describes two stages of this process:
- Replacement of Activation Functions: Fine-tuning pretrained models initially using non-ReLU activations with ReLU, thereby increasing activation sparsity significantly.
- Insertion of Additional ReLU Layers: By placing extra ReLU layers after normalization layers, both in attention and feed-forward components, the paper achieves further enhancement of sparsity, decreasing FLOPS without notable accuracy loss.
Models subjected to this relufication process showed a substantial improvement in efficiency. For large models, the relufication led to FLOPS reductions up to threefold, effectively optimizing computational and memory requirements while maintaining competitive performance on standard NLP benchmarks.
Leveraging Aggregated Sparsity and Future Directions
The paper introduces the notion of aggregated sparsity—a measure of neuron utilization across several tokens. It reveals that neurons activated during one token generation tend to be re-utilized for subsequent tokens, thus offering an opportunity to streamline computational processes through inferential optimizations like speculative decoding. Speculative decoding benefits further from aggregated sparsity, resulting in enhanced speedup by leveraging shared activations efficiently.
The authors also explore the potential of modified ReLU activations, such as shifted ReLU, to further increase sparsity without compromising model performance. This direction suggests that performance optimization might be achieved through strategic manipulation of activation thresholds.
Conclusion
The research advocates for a reassessment of activation function preferences in LLMs, emphasizing activation sparsity as a means to reconcile robust performance with computational efficiency. By reviving ReLU, the paper provides a practical pathway to more resource-efficient LLMs, potentially broadening deployment across various hardware environments. The insights into activation patterns and strategies to exploit them pave the way for future research aimed at enhancing the efficiency of AI systems through architectural innovations.