Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

86 4 2

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (2310.04564v1)

Published 6 Oct 2023 in cs.LG and cs.AI

Abstract: LLMs with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs.

PDF HTML Abstract

Exploiting Activation Sparsity in LLMs: A Case for ReLU

LLMs have transformed artificial intelligence applications, but the computational demands during inference create challenges for deployment in resource-constrained environments. This paper investigates the role of activation functions and re-evaluates the potential use of the Rectified Linear Unit (ReLU) in LLMs. The paper explores activation sparsity to enhance model efficiency without significantly sacrificing performance, making the case for leveraging ReLU activations over alternatives like GELU and SiLU.

Activation Functions and Computational Load

The paper first challenges the trend favoring smoother activation functions in modern LLMs. Historically, alternatives such as GELU and SiLU have been preferred due to their marginal improvements in convergence and accuracy. However, through an experimental setup comparing these to ReLU, the paper finds that the performance differences are negligible when models are trained on substantial datasets. The authors argue that while smoother activation functions may offer slight performance gains, the increased computational cost during inference outweighs these benefits when efficiency is prioritized.

Activation Sparsity: Theoretical Insights and Empirical Results

A key element of this research is the discussion of activation sparsity—a phenomenon where a substantial portion of neurons remains inactive (zeroed-out) during forward passes of the network. The paper illustrates that ReLU induces significant activation sparsity, thereby reducing the number of floating-point operations (FLOPS) during inference. For example, in an OPT model using ReLU, the sparsity in some layers can exceed 90%, translating into a 32% reduction in computation needed for inference compared to baseline models using GELU or SiLU.

Practical Efficiency Gains Through "Relufication"

The authors introduce the concept of "relufication," which involves replacing existing activation functions with ReLU in pretrained LLMs and further optimizing the network structure. The paper describes two stages of this process:

Replacement of Activation Functions: Fine-tuning pretrained models initially using non-ReLU activations with ReLU, thereby increasing activation sparsity significantly.
Insertion of Additional ReLU Layers: By placing extra ReLU layers after normalization layers, both in attention and feed-forward components, the paper achieves further enhancement of sparsity, decreasing FLOPS without notable accuracy loss.

Models subjected to this relufication process showed a substantial improvement in efficiency. For large models, the relufication led to FLOPS reductions up to threefold, effectively optimizing computational and memory requirements while maintaining competitive performance on standard NLP benchmarks.

Leveraging Aggregated Sparsity and Future Directions

The paper introduces the notion of aggregated sparsity—a measure of neuron utilization across several tokens. It reveals that neurons activated during one token generation tend to be re-utilized for subsequent tokens, thus offering an opportunity to streamline computational processes through inferential optimizations like speculative decoding. Speculative decoding benefits further from aggregated sparsity, resulting in enhanced speedup by leveraging shared activations efficiently.

The authors also explore the potential of modified ReLU activations, such as shifted ReLU, to further increase sparsity without compromising model performance. This direction suggests that performance optimization might be achieved through strategic manipulation of activation thresholds.

Conclusion

The research advocates for a reassessment of activation function preferences in LLMs, emphasizing activation sparsity as a means to reconcile robust performance with computational efficiency. By reviving ReLU, the paper provides a practical pathway to more resource-efficient LLMs, potentially broadening deployment across various hardware environments. The insights into activation patterns and strategies to exploit them pave the way for future research aimed at enhancing the efficiency of AI systems through architectural innovations.

PDF Markdown Bookmark Chat (Pro)

References (83)

Authors (8)

Iman Mirzadeh (11 papers)
Keivan Alizadeh (8 papers)
Sachin Mehta (48 papers)
Carlo C Del Mundo (5 papers)
Oncel Tuzel (62 papers)
Golnoosh Samei (4 papers)
Mohammad Rastegari (57 papers)
Mehrdad Farajtabar (56 papers)

Citations (43)

View on Semantic Scholar

Tweets

https://twitter.com/marksaroufim/status/1814426582800576946

https://twitter.com/vikhyatk/status/1755083599228109078

https://twitter.com/jeethu/status/1758792204611387870

https://twitter.com/activewarp/status/1766701582861599025

https://twitter.com/podlipensky/status/1761569397909467473

https://twitter.com/IevgenRedko/status/1810027207848579166

HackerNews

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models(2023) (2 points, 1 comment)

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (2023) (4 points, 9 comments)