ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (2310.04564v1)
Abstract: LLMs with billions of parameters have drastically transformed AI applications. However, their demanding computation during inference has raised significant challenges for deployment on resource-constrained devices. Despite recent trends favoring alternative activation functions such as GELU or SiLU, known for increased computation, this study strongly advocates for reinstating ReLU activation in LLMs. We demonstrate that using the ReLU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in ReLU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using ReLU activations with minimal performance trade-offs.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. CoRR, 2023.
- The falcon series of language models: Towards open frontier models. 2023.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
- Layer normalization, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Quip: 2-bit quantization of large language models with guarantees. CoRR, abs/2307.13304, 2023. doi: 10.48550/arXiv.2307.13304.
- Task-specific expert pruning for sparse mixture-of-experts. ArXiv, abs/2206.00277, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
- Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org, 2017.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Qlora: Efficient finetuning of quantized llms. CoRR, 2023a.
- Spqr: A sparse-quantized representation for near-lossless LLM weight compression. CoRR, abs/2306.03078, 2023b. doi: 10.48550/arXiv.2306.03078.
- Blockwise compression of transformer-based models without retraining. arXiv preprint arXiv:2304.01483, 2023.
- Glam: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 2022.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
- A review of sparse expert models in deep learning. ArXiv, abs/2209.01667, 2022a.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39, 2022b.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR, 2023. URL https://proceedings.mlr.press/v202/frantar23a.html.
- GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323, 2022. doi: 10.48550/arXiv.2210.17323.
- Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Trans. Syst. Sci. Cybern., 5:322–333, 1969.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Knowledge distillation of large language models. CoRR, 2023.
- Retrospective: Eie: Efficient inference engine on sparse and compressed neural network, 2023.
- Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Neural Information Processing Systems, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
- Distilling the knowledge in a neural network. CoRR, 2015.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/arXiv.2203.15556.
- Infinite attention: NNGP and NTK for deep attention networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4376–4386. PMLR, 2020.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8003–8017. Association for Computational Linguistics, 2023. URL https://doi.org/10.18653/v1/2023.findings-acl.507.
- Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. arXiv preprint arXiv:2308.12066, 2023.
- The emergence of essential sparsity in large pre-trained models: The weights that matter. CoRR, 2023.
- Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=-b5OSCydOMe.
- Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152, 2023a. doi: 10.48550/arXiv.2305.14152. URL https://doi.org/10.48550/arXiv.2305.14152.
- Squeezellm: Dense-and-sparse quantization. CoRR, abs/2306.07629, 2023b. doi: 10.48550/arXiv.2306.07629.
- Full stack optimization of transformer inference. In Architecture and System Support for Transformer Models (ASSYST@ ISCA 2023), 2023c.
- Speculative decoding with big little decoder, 2023d.
- Self-normalizing neural networks. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 971–980, 2017.
- Serving moe models on resource-constrained edge devices via dynamic expert swapping, 2023.
- Inducing and exploiting activation sparsity for fast inference on deep neural networks. In International Conference on Machine Learning, pages 5533–5543. PMLR, 2020.
- OWQ: lessons learned from activation outliers for weight quantization in large language models. CoRR, abs/2306.02272, 2023. doi: 10.48550/arXiv.2306.02272.
- Fast inference from transformers via speculative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19274–19286. PMLR, 2023.
- Large models are parsimonious learners: Activation sparsity in trained transformers. arXiv preprint arXiv:2210.06313, 2022.
- The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978, 2023. doi: 10.48550/arXiv.2306.00978.
- Llm-qat: Data-free quantization aware training for large language models. CoRR, 2023a.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023b.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
- Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
- NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b.
- Do transformer modifications transfer across implementations and applications? In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 5758–5773. Association for Computational Linguistics, 2021.
- Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. CoRR, 2023.
- The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/arXiv.2306.01116.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- From sparse to soft mixtures of experts. arXiv preprint arXiv:2308.00951, 2023.
- Zero: memory optimizations toward training trillion parameter models. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18332–18346. PMLR, 2022.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- What matters in the structured pruning of generative language models? CoRR, 2023.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- A study on relu and softmax in transformer. CoRR, 2023.
- Flexgen: High-throughput generative inference of large language models with a single GPU. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 31094–31116. PMLR, 2023.
- Training multi-layer over-parametrized neural network in subquadratic time, 2021.
- A simple and effective pruning approach for large language models. CoRR, 2023.
- No language left behind: Scaling human-centered machine translation, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Replacing softmax with relu in vision transformers, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR, 2023.
- Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
- Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–12371, 2019.
- Pruning meets low-rank parameter-efficient fine-tuning. CoRR, 2023.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022a.
- Moefication: Transformer feed-forward layers are mixtures of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 877–890. Association for Computational Linguistics, 2022b.
- A survey on model compression for large language models. CoRR, 2023.
- St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
- Iman Mirzadeh (11 papers)
- Keivan Alizadeh (8 papers)
- Sachin Mehta (48 papers)
- Carlo C Del Mundo (5 papers)
- Oncel Tuzel (62 papers)
- Golnoosh Samei (4 papers)
- Mohammad Rastegari (57 papers)
- Mehrdad Farajtabar (56 papers)