Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models (2402.13516v6)

Published 21 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Activation sparsity refers to the existence of considerable weakly-contributed elements among activation outputs. As a prevalent property of the models using the ReLU activation function, activation sparsity has been proven a promising paradigm to boost model inference efficiency. Nevertheless, most LLMs adopt activation functions without intrinsic activation sparsity (e.g., GELU and Swish). Some recent efforts have explored introducing ReLU or its variants as the substitutive activation function to help LLMs achieve activation sparsity and inference acceleration, but few can simultaneously obtain high sparsity and comparable model performance. This paper introduces a simple and effective sparsification method named "ProSparse" to push LLMs for higher activation sparsity while maintaining comparable performance. Specifically, after substituting the activation function of LLMs with ReLU, ProSparse adopts progressive sparsity regularization with a factor smoothly increasing along the multi-stage sine curves. This can enhance activation sparsity and mitigate performance degradation by avoiding radical shifts in activation distributions. With ProSparse, we obtain high sparsity of 89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size MiniCPM-1B, respectively, achieving comparable performance to their original Swish-activated versions. These present the most sparsely activated models among open-source LLaMA versions and competitive end-size models, considerably surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference acceleration experiments further demonstrate the significant practical acceleration potential of LLMs with higher activation sparsity, obtaining up to 4.52$\times$ inference speedup.

An Expert Review of "ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity Within LLMs"

The paper presents "ProSparse," a methodology designed to introduce and enhance intrinsic activation sparsity in LLMs. The research addresses a critical challenge in deploying LLMs, which is the significant computational cost associated with inference. The paper's central claim is that by incorporating intrinsic activation sparsity, it is possible to substantially reduce these costs without compromising model performance.

Background and Motivation

The foundational concept of activation sparsity is rooted in the observation that within neural network activation outputs, many elements contribute minimally to the final outputs. Early LLMs leveraged ReLU for activation sparsity, but this paper identifies an industry trend where contemporary models, such as LLaMA and Falcon, favor activation functions like GELU and Swish that lack this sparsity property.

The authors posit that the ReLUfication—or the substitution of non-ReLU activation functions with ReLU—in LLMs could enhance sparsity and thus improve inference efficiency. However, they point out that straightforward function substitution has limitations, typically leading to either insufficient sparsity or degraded model performance. This issue forms the core motivation for the development of ProSparse.

Methodology: The ProSparse Framework

ProSparse is delineated into three principal steps:

  1. Activation Function Substitution: Initially, the Swish or GELU activation functions in LLMs are substituted with ReLU. This step introduces a basic level of activation sparsity but on its own is insufficient for the desired performance outcomes.
  2. Progressive Sparsity Regularization: The paper's novel contribution lies in its progressive regularization approach. Utilizing a sine wave function, the regularization factor gradually escalates in multiple stages rather than remaining static. This method allows the model ample time to adapt to changes in activation distribution, mitigating abrupt performance losses usually accompanied by heavy regularization.
  3. Activation Threshold Shifting: Finally, to prune non-essential activations further and increase sparsity, the activation threshold in ReLU is shifted to a positive value. This adjustment ensures that low-impact neurons are efficiently disregarded, thereby increasing model sparsity.

The methodology is applied to ReLUfication of the LLaMA2 models, achieving high sparsity rates (e.g., 89.32% for LLaMA2-7B) with negligible performance degradation.

Key Results and Implications

The most compelling outcome of this research is the effective transformation of LLaMA2 models to achieve high activation sparsity without compromising on the capability, matched closely with the original non-sparse architectures on traditional NLP benchmarks. Notably, the ProSparse methodology achieved performance parity with Swish-activated benchmarks, such as MMLU, AGI Eval, and others, while enhancing computation efficiency.

The robustness of ProSparse is further demonstrated through hardware-level acceleration tests. Implementing both approximate and accurate acceleration algorithms revealed that the higher sparsity derived from ReLUfication substantially benefits predictor-based acceleration frameworks in inference tasks. This empirical evidence supports the practical viability of ProSparse in reducing inference time and computational cost, reinforcing its significance in deploying LLMs efficiently.

Future Directions and Theoretical Implications

While the outcomes of this research are promising, the authors acknowledge the scalability limits of their findings, having tested models up to 13 billion parameters. They encourage further exploration with both smaller and larger models to generalize the conclusions. Additionally, there's an identified need to enhance the practicality of sparse computation frameworks, particularly the front-end stages of feed-forward networks and attention mechanisms.

ProSparse represents a significant leap towards alleviating the financial and environmental burdens tied to AI model deployment. By tackling sparsity through a methodically progressive and adaptable framework, this work underscores the critical balance between maintaining high computational efficiency and preserving the innate intelligence of LLMs.

In summary, the ProSparse framework is a notable contribution to the field, offering an innovative, scalable approach to enhancing LLM inference efficiency. Given the growing interest in sustainable AI solutions, its implications for both academic and industry applications are profound, providing a sophisticated yet feasible method to meet the dual demands of efficiency and performance in modern neural architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  2. The Falcon series of open language models. arXiv preprint arXiv:2311.16867.
  3. DeepSpeed-Inference: enabling efficient inference of Transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
  4. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  5. Towards efficient post-training quantization of pre-trained language models. Advances in Neural Information Processing Systems, 35:1405–1418.
  6. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  11. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
  12. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  13. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
  14. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470.
  15. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  16. Language modeling with gated convolutional networks. In International Conference on Machine Learning, pages 933–941. PMLR.
  17. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805.
  18. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  20. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11.
  21. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  22. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  23. Georgios Georgiadis. 2019. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7085–7095.
  24. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
  25. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
  26. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28.
  27. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
  28. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  29. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415.
  30. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  31. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
  32. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  33. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  34. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713.
  35. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  36. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In International Conference on Machine Learning, pages 5533–5543. PMLR.
  37. Fast inference from Transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  38. PAQ: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115.
  39. The efficacy of l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization in two-layer neural networks. arXiv preprint arXiv:2010.01048.
  40. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  41. Large models are parsimonious learners: Activation sparsity in trained Transformers. arXiv preprint arXiv:2210.06313.
  42. Deja Vu: Contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
  43. The Flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org.
  44. Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
  45. Transformed l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization for learning sparse deep neural networks. Neural Networks, 119:286–298.
  46. LLM-Pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
  47. SpecInfer: Accelerating generative LLM serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
  48. ReLU strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564.
  49. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations.
  50. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334.
  51. OpenAI. 2023. ChatGPT.
  52. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  53. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534.
  54. Efficiently scaling Transformer inference. Proceedings of Machine Learning and Systems, 5.
  55. Sparse then prune: Toward efficient vision Transformers. arXiv preprint arXiv:2307.11988.
  56. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  57. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  58. WinoGrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740.
  59. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  60. SocialIQA: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473.
  61. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
  62. Noam Shazeer. 2020. GLU variants improve Transformer. arXiv preprint arXiv:2002.05202.
  63. A study on ReLU and Softmax in Transformer. arXiv preprint arXiv:2302.06461.
  64. PowerInfer: Fast large language model serving with a consumer-grade GPU. arXiv preprint arXiv:2312.12456.
  65. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  66. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  67. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
  68. Robert Tibshirani. 1996. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
  69. Training data-efficient image Transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR.
  70. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  71. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  72. Structured pruning for efficient ConvNets via incremental regularization. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
  73. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248.
  74. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  75. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  76. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29.
  77. Wikimedia Foundation. 2022. Wikimedia downloads.
  78. Replacing softmax with ReLU in vision Transformers. arXiv preprint arXiv:2309.08586.
  79. Sheared LLaMA: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  80. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  81. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302.
  82. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67.
  83. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  84. OPT: Open pre-trained Transformer language models. arXiv preprint arXiv:2205.01068.
  85. MoEfication: Transformer feed-forward layers are mixtures of experts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890.
  86. Relu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT wins: Discovering efficient activation functions for sparse llms. arXiv preprint arXiv:2402.03804.
  87. Loss functions for image restoration with neural networks. IEEE Transactions on computational imaging, 3(1):47–57.
  88. Improving neural network quantization without retraining using outlier channel splitting. In International Conference on Machine Learning, pages 7543–7552. PMLR.
  89. A survey of large language models. arXiv preprint arXiv:2303.18223.
  90. AGIEval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  91. Vision Transformer pruning. arXiv preprint arXiv:2104.08500.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Chenyang Song (7 papers)
  2. Xu Han (270 papers)
  3. Zhengyan Zhang (46 papers)
  4. Shengding Hu (34 papers)
  5. Xiyu Shi (3 papers)
  6. Kuai Li (4 papers)
  7. Chen Chen (752 papers)
  8. Zhiyuan Liu (433 papers)
  9. Guangli Li (10 papers)
  10. Tao Yang (520 papers)
  11. Maosong Sun (337 papers)
Citations (15)
Github Logo Streamline Icon: https://streamlinehq.com