Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference (2403.20306v1)
Abstract: With the ubiquitous use of modern LLMs across industries, the inference serving for these models is ever expanding. Given the high compute and memory requirements of modern LLMs, more and more top-of-the-line GPUs are being deployed to serve these models. Energy availability has come to the forefront as the biggest challenge for data center expansion to serve these models. In this paper, we present the trade-offs brought up by making energy efficiency the primary goal of LLM serving under performance SLOs. We show that depending on the inputs, the model, and the service-level agreements, there are several knobs available to the LLM inference provider to use for being energy efficient. We characterize the impact of these knobs on the latency, throughput, as well as the energy. By exploring these trade-offs, we offer valuable insights into optimizing energy usage without compromising on performance, thereby paving the way for sustainable and cost-effective LLM deployment in data center environments.
- A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills,” 2023.
- V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, “SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks,” in Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
- K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, and M. Farajtabar, “LLM in a flash: Efficient Large Language Model Inference with Limited Memory,” 2024.
- A. Andrae and T. Edler, “On Global Electricity Usage of Communication Technology: Trends to 2030,” Challenges, vol. 6, 2015.
- G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, “Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey,” ACM Computing Surveys, vol. 55, no. 4, p. 1–36, Nov. 2022. [Online]. Available: http://dx.doi.org/10.1145/3527156
- K. Blunt and J. Hiller, “Big Tech’s Latest Obsession Is Finding Enough Energy,” https://www.wsj.com/business/energy-oil/big-techs-latest-obsession-is-finding-enough-energy-f00055b2.
- S. Chen, C. Delimitrou, and J. F. Martínez, “PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services,” in ASPLOS, 2019.
- S. Chen, A. Jin, C. Delimitrou, and J. F. Martínez, “ReTail: Opting for Learning Simplicity to Enable QoS-Aware Power Management in the Cloud,” in HPCA, 2022.
- T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” 2022.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2019.
- A. Faiz, S. Kaneda, R. Wang, R. Osi, P. Sharma, F. Chen, and L. Jiangr, “LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models,” 2024.
- Forbes, “Generative AI Breaks The Data Center,” https://www.forbes.com/sites/tiriasresearch/2023/05/12/generative-ai-breaks-the-data-center-data-center-infrastructure-and-operating-costs-projected-to-increase-to-over-76-billion-by-2028/?sh=5bca69067c15.
- S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y. Li, “Textbooks are all you need,” 2023.
- U. Gupta, Y. Kim, S. Lee, J. Tse, H. S. Lee, G. Wei, D. Brooks, and C. Wu, “Chasing Carbon: The Elusive Environmental Footprint of Computing,” in HPCA ’21, 2021.
- U. Gupta, M. Elgamal, G. Hills, G.-Y. Wei, H.-H. S. Lee, D. Brooks, and C.-J. Wu, “ACT: designing sustainable computer systems with an architectural carbon modeling tool,” in ISCA, 2022.
- M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and K. S. McKinley, “Exploiting Heterogeneity for Tail Latency and Energy Efficiency,” in MICRO, 2017.
- C.-H. Hsu, Y. Zhang, M. A. Laurenzano, D. Meisner, T. Wenisch, J. Mars, L. Tang, and R. G. Dreslinski, “Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting,” in HPCA, 2015.
- Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y. Luo, Y. Wen, and T. Zhang, “Characterization of Large Language Model Development in the Datacenter,” 2024.
- H. Kasture, D. B. Bartolini, N. Beckmann, and D. Sanchez, “Rubik: Fast analytical power management for latency-critical systems,” in MICRO, 2015.
- H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, “NPE: An FPGA-based Overlay Processor for Natural Language Processing,” in Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’21, 2021.
- S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney, Y. S. Shao, and A. Gholami, “Full Stack Optimization of Transformer Inference: a Survey,” 2023.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in SOSP, 2023.
- W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” in SOSP, 2023.
- Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving,” in OSDI, 2023.
- D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and C. Kozyrakis, “Towards energy proportionality for large-scale latency-critical workloads,” in ISCA, 2014.
- E. R. Masanet, A. Shehabi, N. Lei, S. J. Smith, and J. G. Koomey, “Recalibrating global data center energy-use estimates,” Science, 2020.
- X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification,” 2024.
- R. Nishtala, V. Petrucci, P. Carpenter, and M. Sjalander, “Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services,” in HPCA, 2020.
- NVIDIA. NVIDIA DGX H100. [Online]. Available: https://www.nvidia.com/en-us/data-center/dgx-h100/
- P. Patel, E. Choukse, C. Zhang, A. Shah, Íñigo Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” 2023.
- P. Patel, E. Choukse, C. Zhang, Íñigo Goiri, B. Warrier, N. Mahalingam, and R. Bianchini, “POLCA: Power Oversubscription in LLM Cloud Providers,” 2023.
- P. Patel, Z. Gong, S. Rizvi, E. Choukse, P. Misra, T. Anderson, and A. Sriraman, “Towards improved power management in cloud gpus,” IEEE Computer Architecture Letters, 2023.
- S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W. Bergeron, J. Kepner, D. Tiwari, and V. Gadepally, “From words to watts: Benchmarking the energy costs of large language model inference,” 2023.
- T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “BLOOM: A 176B-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
- P. Schmid, “Llama 2 is here - Get it on Hugging Face,” https://huggingface.co/blog/llama2, 2023.
- P. Schmid, O. Sanseviero, P. Cuence, and L. Tunstall, “Fine-tune FLAN-T5 XL/XXL using DeepSpeed and Hugging Face Transformers,” https://www.philschmid.de/ fine-tune-flan-t5-deepspeed, 2023.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A Distributed Serving System for Transformer-Based Generative Models,” in OSDI, 2022.
- H. Zhang, A. Ning, R. Prabhakar, and D. Wentzlaff, “A Hardware Evaluation Framework for Large Language Model Inference,” 2023.
- L. Zhou, L. N. Bhuyan, and K. K. Ramakrishnan, “Gemini: Learning to Manage CPU Power for Latency-Critical Search Engines,” in MICRO, 2020.
- Jovan Stojkovic (5 papers)
- Esha Choukse (15 papers)
- Chaojie Zhang (28 papers)
- Josep Torrellas (20 papers)
- Inigo Goiri (1 paper)