Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

231 33

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models (2402.13516v6)

Published 21 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most LLMs adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$\times$ inference speedup.

PDF HTML Abstract

An Expert Review of "ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity Within LLMs"

The paper presents "ProSparse," a methodology designed to introduce and enhance intrinsic activation sparsity in LLMs. The research addresses a critical challenge in deploying LLMs, which is the significant computational cost associated with inference. The paper's central claim is that by incorporating intrinsic activation sparsity, it is possible to substantially reduce these costs without compromising model performance.

Background and Motivation

The foundational concept of activation sparsity is rooted in the observation that within neural network activation outputs, many elements contribute minimally to the final outputs. Early LLMs leveraged ReLU for activation sparsity, but this paper identifies an industry trend where contemporary models, such as LLaMA and Falcon, favor activation functions like GELU and Swish that lack this sparsity property.

The authors posit that the ReLUfication—or the substitution of non-ReLU activation functions with ReLU—in LLMs could enhance sparsity and thus improve inference efficiency. However, they point out that straightforward function substitution has limitations, typically leading to either insufficient sparsity or degraded model performance. This issue forms the core motivation for the development of ProSparse.

Methodology: The ProSparse Framework

ProSparse is delineated into three principal steps:

Activation Function Substitution: Initially, the Swish or GELU activation functions in LLMs are substituted with ReLU. This step introduces a basic level of activation sparsity but on its own is insufficient for the desired performance outcomes.
Progressive Sparsity Regularization: The paper's novel contribution lies in its progressive regularization approach. Utilizing a sine wave function, the regularization factor gradually escalates in multiple stages rather than remaining static. This method allows the model ample time to adapt to changes in activation distribution, mitigating abrupt performance losses usually accompanied by heavy regularization.
Activation Threshold Shifting: Finally, to prune non-essential activations further and increase sparsity, the activation threshold in ReLU is shifted to a positive value. This adjustment ensures that low-impact neurons are efficiently disregarded, thereby increasing model sparsity.

The methodology is applied to ReLUfication of the LLaMA2 models, achieving high sparsity rates (e.g., 89.32% for LLaMA2-7B) with negligible performance degradation.

Key Results and Implications

The most compelling outcome of this research is the effective transformation of LLaMA2 models to achieve high activation sparsity without compromising on the capability, matched closely with the original non-sparse architectures on traditional NLP benchmarks. Notably, the ProSparse methodology achieved performance parity with Swish-activated benchmarks, such as MMLU, AGI Eval, and others, while enhancing computation efficiency.

The robustness of ProSparse is further demonstrated through hardware-level acceleration tests. Implementing both approximate and accurate acceleration algorithms revealed that the higher sparsity derived from ReLUfication substantially benefits predictor-based acceleration frameworks in inference tasks. This empirical evidence supports the practical viability of ProSparse in reducing inference time and computational cost, reinforcing its significance in deploying LLMs efficiently.

Future Directions and Theoretical Implications

While the outcomes of this research are promising, the authors acknowledge the scalability limits of their findings, having tested models up to 13 billion parameters. They encourage further exploration with both smaller and larger models to generalize the conclusions. Additionally, there's an identified need to enhance the practicality of sparse computation frameworks, particularly the front-end stages of feed-forward networks and attention mechanisms.

ProSparse represents a significant leap towards alleviating the financial and environmental burdens tied to AI model deployment. By tackling sparsity through a methodically progressive and adaptable framework, this work underscores the critical balance between maintaining high computational efficiency and preserving the innate intelligence of LLMs.

In summary, the ProSparse framework is a notable contribution to the field, offering an innovative, scalable approach to enhancing LLM inference efficiency. Given the growing interest in sustainable AI solutions, its implications for both academic and industry applications are profound, providing a sophisticated yet feasible method to meet the dual demands of efficiency and performance in modern neural architectures.

PDF Markdown Bookmark Chat (Pro)

References (91)

Authors (11)

Chenyang Song (7 papers)
Xu Han (270 papers)
Zhengyan Zhang (46 papers)
Shengding Hu (34 papers)
Xiyu Shi (3 papers)
Kuai Li (4 papers)
Chen Chen (752 papers)
Zhiyuan Liu (433 papers)
Guangli Li (10 papers)
Tao Yang (520 papers)
Maosong Sun (337 papers)

Citations (15)

View on Semantic Scholar

GitHub

GitHub - Raincleared-Song/sparse_gpu_operator: GPU operators for sparse tensor operations (33 stars)

Tweets

https://twitter.com/teortaxesTex/status/1793190738190090491

https://twitter.com/teortaxesTex/status/1878594214277394462

https://twitter.com/mctalentowen/status/1795354009429242068

https://twitter.com/fab_altendorfer/status/1762861130186142179