Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (2310.04564v1)

Published 6 Oct 2023 in cs.LG and cs.AI

Abstract: LLMs with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs.

Exploiting Activation Sparsity in LLMs: A Case for ReLU

LLMs have transformed artificial intelligence applications, but the computational demands during inference create challenges for deployment in resource-constrained environments. This paper investigates the role of activation functions and re-evaluates the potential use of the Rectified Linear Unit (ReLU) in LLMs. The paper explores activation sparsity to enhance model efficiency without significantly sacrificing performance, making the case for leveraging ReLU activations over alternatives like GELU and SiLU.

Activation Functions and Computational Load

The paper first challenges the trend favoring smoother activation functions in modern LLMs. Historically, alternatives such as GELU and SiLU have been preferred due to their marginal improvements in convergence and accuracy. However, through an experimental setup comparing these to ReLU, the paper finds that the performance differences are negligible when models are trained on substantial datasets. The authors argue that while smoother activation functions may offer slight performance gains, the increased computational cost during inference outweighs these benefits when efficiency is prioritized.

Activation Sparsity: Theoretical Insights and Empirical Results

A key element of this research is the discussion of activation sparsity—a phenomenon where a substantial portion of neurons remains inactive (zeroed-out) during forward passes of the network. The paper illustrates that ReLU induces significant activation sparsity, thereby reducing the number of floating-point operations (FLOPS) during inference. For example, in an OPT model using ReLU, the sparsity in some layers can exceed 90%, translating into a 32% reduction in computation needed for inference compared to baseline models using GELU or SiLU.

Practical Efficiency Gains Through "Relufication"

The authors introduce the concept of "relufication," which involves replacing existing activation functions with ReLU in pretrained LLMs and further optimizing the network structure. The paper describes two stages of this process:

  1. Replacement of Activation Functions: Fine-tuning pretrained models initially using non-ReLU activations with ReLU, thereby increasing activation sparsity significantly.
  2. Insertion of Additional ReLU Layers: By placing extra ReLU layers after normalization layers, both in attention and feed-forward components, the paper achieves further enhancement of sparsity, decreasing FLOPS without notable accuracy loss.

Models subjected to this relufication process showed a substantial improvement in efficiency. For large models, the relufication led to FLOPS reductions up to threefold, effectively optimizing computational and memory requirements while maintaining competitive performance on standard NLP benchmarks.

Leveraging Aggregated Sparsity and Future Directions

The paper introduces the notion of aggregated sparsity—a measure of neuron utilization across several tokens. It reveals that neurons activated during one token generation tend to be re-utilized for subsequent tokens, thus offering an opportunity to streamline computational processes through inferential optimizations like speculative decoding. Speculative decoding benefits further from aggregated sparsity, resulting in enhanced speedup by leveraging shared activations efficiently.

The authors also explore the potential of modified ReLU activations, such as shifted ReLU, to further increase sparsity without compromising model performance. This direction suggests that performance optimization might be achieved through strategic manipulation of activation thresholds.

Conclusion

The research advocates for a reassessment of activation function preferences in LLMs, emphasizing activation sparsity as a means to reconcile robust performance with computational efficiency. By reviving ReLU, the paper provides a practical pathway to more resource-efficient LLMs, potentially broadening deployment across various hardware environments. The insights into activation patterns and strategies to exploit them pave the way for future research aimed at enhancing the efficiency of AI systems through architectural innovations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models. CoRR, 2023.
  2. The falcon series of language models: Towards open frontier models. 2023.
  3. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  4. Layer normalization, 2016.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Quip: 2-bit quantization of large language models with guarantees. CoRR, abs/2307.13304, 2023. doi: 10.48550/arXiv.2307.13304.
  8. Task-specific expert pruning for sparse mixture-of-experts. ArXiv, abs/2206.00277, 2022.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  11. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org, 2017.
  12. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  13. Qlora: Efficient finetuning of quantized llms. CoRR, 2023a.
  14. Spqr: A sparse-quantized representation for near-lossless LLM weight compression. CoRR, abs/2306.03078, 2023b. doi: 10.48550/arXiv.2306.03078.
  15. Blockwise compression of transformer-based models without retraining. arXiv preprint arXiv:2304.01483, 2023.
  16. Glam: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 2022.
  17. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  18. A review of sparse expert models in deep learning. ArXiv, abs/2209.01667, 2022a.
  19. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39, 2022b.
  20. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR, 2023. URL https://proceedings.mlr.press/v202/frantar23a.html.
  21. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022. doi: 10.48550/arXiv.2210.17323.
  22. Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern., 5:322–333, 1969.
  23. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  24. Knowledge distillation of large language models. CoRR, 2023.
  25. Retrospective: Eie: Efficient inference engine on sparse and compressed neural network, 2023.
  26. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Neural Information Processing Systems, 2021.
  27. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  28. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  29. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  30. Distilling the knowledge in a neural network. CoRR, 2015.
  31. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/arXiv.2203.15556.
  32. Infinite attention: NNGP and NTK for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4376–4386. PMLR, 2020.
  33. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8003–8017. Association for Computational Linguistics, 2023. URL https://doi.org/10.18653/v1/2023.findings-acl.507.
  34. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. arXiv preprint arXiv:2308.12066, 2023.
  35. The emergence of essential sparsity in large pre-trained models: The weights that matter. CoRR, 2023.
  36. Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=-b5OSCydOMe.
  37. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  38. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 2023a. doi: 10.48550/arXiv.2305.14152. URL https://doi.org/10.48550/arXiv.2305.14152.
  39. Squeezellm: Dense-and-sparse quantization. CoRR, abs/2306.07629, 2023b. doi: 10.48550/arXiv.2306.07629.
  40. Full stack optimization of transformer inference. In Architecture and System Support for Transformer Models (ASSYST@ ISCA 2023), 2023c.
  41. Speculative decoding with big little decoder, 2023d.
  42. Self-normalizing neural networks. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 971–980, 2017.
  43. Serving moe models on resource-constrained edge devices via dynamic expert swapping, 2023.
  44. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In International Conference on Machine Learning, pages 5533–5543. PMLR, 2020.
  45. OWQ: lessons learned from activation outliers for weight quantization in large language models. CoRR, abs/2306.02272, 2023. doi: 10.48550/arXiv.2306.02272.
  46. Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023.
  47. Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022.
  48. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  49. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978, 2023. doi: 10.48550/arXiv.2306.00978.
  50. Llm-qat: Data-free quantization aware training for large language models. CoRR, 2023a.
  51. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023b.
  52. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  53. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
  54. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  55. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
  56. NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b.
  57. Do transformer modifications transfer across implementations and applications? In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5758–5773. Association for Computational Linguistics, 2021.
  58. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. CoRR, 2023.
  59. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/arXiv.2306.01116.
  60. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  61. From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023.
  62. Zero: memory optimizations toward training trillion parameter models. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020.
  63. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18332–18346. PMLR, 2022.
  64. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  65. What matters in the structured pruning of generative language models? CoRR, 2023.
  66. Noam Shazeer. Glu variants improve transformer, 2020.
  67. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  68. A study on relu and softmax in transformer. CoRR, 2023.
  69. Flexgen: High-throughput generative inference of large language models with a single GPU. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31094–31116. PMLR, 2023.
  70. Training multi-layer over-parametrized neural network in subquadratic time, 2021.
  71. A simple and effective pruning approach for large language models. CoRR, 2023.
  72. No language left behind: Scaling human-centered machine translation, 2022.
  73. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  74. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  75. Replacing softmax with relu in vision transformers, 2023.
  76. Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023.
  77. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
  78. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–12371, 2019.
  79. Pruning meets low-rank parameter-efficient fine-tuning. CoRR, 2023.
  80. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022a.
  81. Moefication: Transformer feed-forward layers are mixtures of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 877–890. Association for Computational Linguistics, 2022b.
  82. A survey on model compression for large language models. CoRR, 2023.
  83. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Iman Mirzadeh (11 papers)
  2. Keivan Alizadeh (8 papers)
  3. Sachin Mehta (48 papers)
  4. Carlo C Del Mundo (5 papers)
  5. Oncel Tuzel (62 papers)
  6. Golnoosh Samei (4 papers)
  7. Mohammad Rastegari (57 papers)
  8. Mehrdad Farajtabar (56 papers)
Citations (43)
Reddit Logo Streamline Icon: https://streamlinehq.com