Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference (2410.18038v1)

Published 23 Oct 2024 in cs.LG and cs.DC

Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. Hybrid batching works well for linear operations as it amortizes the cost of loading model weights from HBM. However, attention computation in hybrid batches remains inefficient because existing attention kernels are optimized for either prefill or decode. In this paper, we present POD-Attention -- the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. We integrate POD-Attention in a state-of-the-art LLM inference scheduler Sarathi-Serve. POD-Attention speeds up attention computation by up to 75% (mean 28%) and increases LLM serving throughput by up to 22% in offline inference. In online inference, POD-Attention enables lower time-to-first-token (TTFT), time-between-tokens (TBT), and request execution latency versus Sarathi-Serve.

An Essay on POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

The paper "POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference" introduces a novel approach aimed at enhancing the efficiency of LLM serving systems. The authors present POD-Attention, a specialized GPU kernel designed to address inefficiencies in existing methods by concurrently computing prefill and decode phases during LLM inference.

Motivation and Contributions

LLM inference constitutes a computationally intensive workload that consists of two distinct phases: the compute-bound prefill and the memory-bound decode. Existing systems deploy hybrid batching techniques to improve GPU resource utilization by amalgamating these phases across different requests. However, this amalgamation faces inefficiency challenges due to the limited scope of current attention kernels, which are often optimized for each phase separately. Consequently, resource underutilization occurs, compromising overall system performance.

POD-Attention distinguishes itself as the first GPU kernel specially tailored to efficiently compute attention for hybrid batches. The kernel seeks to maximize GPU utilization by allowing concurrent execution of prefill and decode operations within the same Streaming Multiprocessor (SM), effectively enhancing both compute and memory bandwidth usage. Integrating POD-Attention with Sarathi-Serve, a state-of-the-art LLM inference scheduler, showcases its practical benefits: accelerating attention computation by up to 75%, and improving throughput by up to 22% in offline inference scenarios.

Technical Approach

POD-Attention leverages a novel SM-aware CTA scheduling technique that guarantees concurrent execution of prefill and decode operations. This method ensures that prefill and decode kernels are co-located on the same SM, alleviating the performance bottlenecks associated with existing techniques that separate these phases. The kernel further optimizes GPU resource allocation using fine-tuned configurations, such as varying tile sizes and the number of CTAs per SM. By fostering concurrent execution, the kernel maximizes the utilization of GPU tensor cores and shared memory, leading to substantial performance gains.

Numerical and Empirical Results

The numerical results elucidated in this work evidence the substantial gains offered by POD-Attention, with observed attention computation speedups reaching up to 75% over traditional methods. The kernel consistently outperforms alternatives, including FlashAttention and FlashInfer, across various workload configurations. Specifically, the integration with Sarathi-Serve not only yields enhanced throughput but also reduces crucial latency metrics, such as time-to-first-token (TTFT) and time-between-tokens (TBT), demonstrating the efficacy of the approach in both offline and online inference paradigms.

Implications and Future Directions

POD-Attention holds significant implications for LLM serving systems, particularly as context lengths continue to extend in modern applications. By addressing the inefficiencies inherent in traditional separate-phase optimizations, this work provides a clear path to more robust and scalable LLM deployments. The kernel’s approach could inspire similar techniques across other ML model architectures, fostering greater concurrency and resource optimization.

Looking forward, the kernel's extension to support upcoming hardware architectures, such as NVIDIA's Hopper, and its integration with advanced inference scheduler systems could further streamline LLM inference performance, maintaining pace with the ever-evolving demands of AI workloads.

In conclusion, POD-Attention represents a crucial step towards optimizing LLM inference by effectively maximizing within-batch resource utilization, setting a promising benchmark for future research in efficient AI model serving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. ccdv/arxiv-summarization. https://huggingface.co/datasets/ccdv/arxiv-summarization.
  2. CUDA C Programming Guide – Hardware Implementation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#hardware-implementation.
  3. Parallel Thread Execution ISA Version 8.5 – Cooperative Thread Arrays. https://docs.nvidia.com/cuda/parallel-thread-execution/#cooperative-thread-arrays.
  4. Parallel Thread Execution ISA Version 8.5 – Special Registers: %smid. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-smid.
  5. https://github.com/vllm-project/vllm.
  6. Programming Tensor Cores in CUDA 9. https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/, 2017.
  7. FlashAttention. https://github.com/Dao-AILab/flash-attention, 2022.
  8. Flash-Decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html, 2023.
  9. Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM, 2023.
  10. AI Infrastructure Spending Forecast to Be Over a Trillion Dollars Over the Next Five Years. https://www.delloro.com/news/ai-infrastructure-spending-forecast-to-be-over-a-trillion-dollars-over-the-next-five-years/, 2024.
  11. Llama-2-7B. https://huggingface.co/meta-llama/Llama-2-7b-hf, 2024.
  12. Merged PR 1865: Critical bug fixes related to sampling. https://github.com/microsoft/sarathi-serve/commit/50e59c51b85b1157e001bb8ee7a1b049d551955d#diff-450b0de5cce8a2341140afed859dc5dd3b913fa6e62d27988fccefeacc7b33ec, 2024.
  13. Meta-Llama-3-8B. https://huggingface.co/meta-llama/Meta-Llama-3-8B, 2024.
  14. Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.html, 2024.
  15. NVIDIA Multi-Instance GPU. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/, 2024.
  16. NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, 2024.
  17. Performance and Tuning. https://docs.vllm.ai/en/v0.6.0/models/performance.html, 2024.
  18. Sarathi-Serve. https://github.com/microsoft/sarathi-serve, 2024.
  19. The State of AI Infrastructure at Scale 2024. https://ai-infrastructure.org/wp-content/uploads/2024/03/The-State-of-AI-Infrastructure-at-Scale-2024.pdf, 2024.
  20. Unify the kernel used in flash attention backend. https://github.com/vllm-project/vllm/pull/6052, 2024.
  21. Upstream Chunked Prefill. https://github.com/vllm-project/vllm/issues/3130, 2024.
  22. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K, 2024.
  23. Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations, 2024.
  24. Vidur: A large-scale simulation framework for llm inference. Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024, Santa Clara, 2024.
  25. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association.
  26. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  27. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  28. Flux: Fast software-based communication overlap on gpus through kernel fusion, 2024.
  29. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  30. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  31. Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024.
  32. Flashdecoding++: Faster large language model inference on gpus, 2024.
  33. Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
  34. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023.
  35. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 402–416, New York, NY, USA, 2022. Association for Computing Machinery.
  36. A framework for fine-grained synchronization of dependent gpu kernels. In Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’24, page 93–105. IEEE Press, 2024.
  37. Efficient memory management for large language model serving with pagedattention. SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
  38. Automatic horizontal fusion for gpu kernels. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 14–27, 2022.
  39. Stream-k: Work-centric parallel decomposition for dense matrix-matrix multiplication on the gpu. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 429–431, New York, NY, USA, 2023. Association for Computing Machinery.
  40. Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, page 407–418, New York, NY, USA, 2013. Association for Computing Machinery.
  41. Splitwise: Efficient generative llm inference using phase splitting. In ISCA, June 2024.
  42. vattention: Dynamic memory management for serving llms without pagedattention, 2024.
  43. Lean attention: Hardware-aware scalable attention mechanism for the decode-phase of transformers, 2024.
  44. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. 2024.
  45. Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, Santa Clara, CA, July 2024. USENIX Association.
  46. Powerinfer: Fast large language model serving with a consumer-grade gpu, 2023.
  47. Dynamollm: Designing llm inference clusters for performance and energy efficiency, 2024.
  48. Scalable kernel fusion for memory-bound gpu applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 191–202, 2014.
  49. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2023, page 93–106, New York, NY, USA, 2022. Association for Computing Machinery.
  50. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism, 2024.
  51. Fast distributed inference serving for large language models, 2023.
  52. Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, page 119–130, New York, NY, USA, 2015. Association for Computing Machinery.
  53. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 107–118, 2012.
  54. Accelerating self-attentions for llm serving with flashinfer, February 2024.
  55. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  56. Ispa: Exploiting intra-sm parallelism in gpus via fine-grained resource management. IEEE Transactions on Computers, 72(5):1473–1487, 2023.
  57. Sglang: Efficient execution of structured language model programs, 2024.
  58. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association.
  59. Nanoflow: Towards optimal large language model serving throughput, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Aditya K Kamath (169 papers)
  2. Ramya Prabhu (2 papers)
  3. Jayashree Mohan (17 papers)
  4. Simon Peter (10 papers)
  5. Ramachandran Ramjee (20 papers)
  6. Ashish Panwar (8 papers)
Citations (2)