Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient and Economic Large Language Model Inference with Attention Offloading (2405.01814v1)

Published 3 May 2024 in cs.LG and cs.DC
Efficient and Economic Large Language Model Inference with Attention Offloading

Abstract: Transformer-based LLMs exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. This mismatch arises from the autoregressive nature of LLMs, where the generation phase comprises operators with varying resource demands. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases. To enhance the efficiency and cost-effectiveness of LLM serving, we introduce the concept of attention offloading. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop Lamina, an LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48x-12.1x higher estimated throughput per dollar than homogeneous solutions.

Efficient LLM Inference with Attention Offloading

Introduction

Transformer-based LLMs like those used in GPT and BERT architectures have been phenomenal in NLP tasks, from chatbots to advanced code completion tools. Yet, their extensive computational demands, particularly during the inference phase, pose significant challenges, especially in cost and efficiency when deployed at scale. Recently, a new approach known as attention offloading has been proposed to address these challenges by optimizing the way computational resources are used during LLM inference.

The Issue at Hand

In typical setups, executing an LLM inference task involves specialized, high-performance accelerators such as NVIDIA's A100 or TPU units. While these units are great at handling heavy computational tasks, they are quite expensive and may not always be utilized efficiently throughout the LLM inference process. This inefficiency is most apparent during the so-called 'attention operations'—a component of the LLM that is particularly memory-intensive and less about brute computational force.

To give a clearer picture, most of the modern accelerators bundle heavy computational resources with high-speed memory (High Bandwidth Memory, HBM). However, the unique demands of attention operations in LLMs (especially during the token generation phase in tasks like conversing with a chatbot or generating code) do not align perfectly with this setup. This mismatch comes down to attention needing more memory bandwidth rather than sheer computational power, which can lead to situations where these costly accelerators are not fully utilized.

Enter Attention Offloading

The core idea here is straightforward yet ingenious: separate the memory-intensive tasks from the purely computational ones by using two distinct sets of devices. This approach uses cheaper, memory-optimized devices for the attention component, while reserving the powerful, expensive accelerators for other computational tasks within the LLM workflow.

Why does this matter? By targeting specific operations to the most suitable hardware, it's possible to not only boost the efficiency of the system (in terms of throughput per dollar spent) but also optimize the overall utilization of costly computational resources. Researchers demonstrated that this setup could lead to an estimated throughput per dollar improvement ranging from 1.48 to over 12 times compared to traditional, non-offloaded systems.

Practical Considerations and Results

The implementation of this method isn't without its hurdles. Key challenges include managing communications between heterogeneous devices effectively since some data need to transit between the memory-focused and compute-focused devices. The balance here is critical: too much communication overhead could negate the benefits of offloading.

In practical testing scenarios using LLMs up to 33 billion parameters in size, the offloading approach proved highly effective, managing communications using existing networking technologies without the requirement for extraordinary bandwidth options. This suggests that deploying such a system in current data center environments is feasible.

In terms of actual performance boosts, when utilizing memory-optimized devices in conjunction with high-end computational units, researchers observed significant enhancements in handling larger batch sizes without a drop in processing speed. This capability directly translates into better handling of simultaneous user requests in real-world applications, such as managing multiple queries to a chatbot or code assistant.

Future Directions and Implications

This research not only provides a compelling method to reduce the costs associated with LLM inference but also opens the door to more specialized uses of hardware in the field of AI and machine learning. As hardware technology evolves and more specialized units enter the market, the principles demonstrated here could guide more nuanced approaches to system architecture in AI deployments. Furthermore, as models continue to grow in size and complexity, innovations like attention offloading will be crucial for maintaining and improving the accessibility and sustainability of AI technologies.

In conclusion, attention offloading represents a practical and impactful advancement in optimizing LLM inference tasks. It's a step forward in marrying the strengths of different technologies to not just do more with less, but to do it better and cheaper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Ray. https://www.ray.io/.
  2. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  3. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  4. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022.
  5. A survey on processing-in-memory techniques: Advances and challenges. Memories-Materials, Devices, Circuits and Systems, 4:100022, 2023.
  6. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  7. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  8. Lazy batching: An sla-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 493–506, Los Alamitos, CA, USA, mar 2021. IEEE Computer Society.
  9. Flash-decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html.
  10. Turbotransformers: An efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, page 389–402, New York, NY, USA, 2021. Association for Computing Machinery.
  11. Attmemo : Accelerating transformers with memoization on big memory systems, 2023.
  12. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
  13. GitHub. GitHub Copilot. https://github.com/features/copilot.
  14. Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 372–385, 2020.
  15. Flashdecoding++: Faster large language model inference on gpus, 2023.
  16. Aquabolt-xl hbm2-pim, lpddr5-pim with in-memory processing, and axdimm with acceleration buffer. IEEE Micro, 42(3):20–30, 2022.
  17. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  18. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  19. System architecture and software stack for gddr6-aim. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–25, 2022.
  20. Fast inference from transformers via speculative decoding, 2023.
  21. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023.
  22. Ring attention with blockwise transformers for near-infinite context, 2023.
  23. Online speculative decoding, 2023.
  24. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
  25. Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification, 2023.
  26. Microsoft. Bing ai. https://chat.bing.com/.
  27. NVIDIA. Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM.
  28. OpenAI. Chatgpt. https://chat.openai.com/.
  29. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
  30. Splitwise: Efficient generative llm inference using phase splitting, 2023.
  31. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
  32. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  33. Flexgen: High-throughput generative inference of large language models with a single gpu. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  34. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  35. E-batch: Energy-efficient and high-throughput rnn batching. ACM Trans. Archit. Code Optim., 19(1), jan 2022.
  36. ShareGPT Team. Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/.
  37. Is chatgpt the ultimate programming assistant – how far is it?, 2023.
  38. Llama: Open and efficient foundation language models, 2023.
  39. Llama 2: Open foundation and fine-tuned chat models, 2023.
  40. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
  41. Fast distributed inference serving for large language models, 2023.
  42. Bp-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
  43. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  44. Opt: Open pre-trained transformer language models, 2022.
  45. H2o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shaoyuan Chen (3 papers)
  2. Yutong Lin (15 papers)
  3. Mingxing Zhang (10 papers)
  4. Yongwei Wu (5 papers)
Citations (5)