POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference (2410.18038v1)

Published 23 Oct 2024 in cs.LG and cs.DC

Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. Hybrid batching works well for linear operations as it amortizes the cost of loading model weights from HBM. However, attention computation in hybrid batches remains inefficient because existing attention kernels are optimized for either prefill or decode. In this paper, we present POD-Attention -- the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. We integrate POD-Attention in a state-of-the-art LLM inference scheduler Sarathi-Serve. POD-Attention speeds up attention computation by up to 75% (mean 28%) and increases LLM serving throughput by up to 22% in offline inference. In online inference, POD-Attention enables lower time-to-first-token (TTFT), time-between-tokens (TBT), and request execution latency versus Sarathi-Serve.

References (59)

Authors (6)

Aditya K Kamath (169 papers)
Ramya Prabhu (2 papers)
Jayashree Mohan (17 papers)
Simon Peter (10 papers)
Ramachandran Ramjee (20 papers)
Ashish Panwar (8 papers)

Citations (2)

View on Semantic Scholar

Summary

An Essay on POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

The paper "POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference" introduces a novel approach aimed at enhancing the efficiency of LLM serving systems. The authors present POD-Attention, a specialized GPU kernel designed to address inefficiencies in existing methods by concurrently computing prefill and decode phases during LLM inference.

Motivation and Contributions

LLM inference constitutes a computationally intensive workload that consists of two distinct phases: the compute-bound prefill and the memory-bound decode. Existing systems deploy hybrid batching techniques to improve GPU resource utilization by amalgamating these phases across different requests. However, this amalgamation faces inefficiency challenges due to the limited scope of current attention kernels, which are often optimized for each phase separately. Consequently, resource underutilization occurs, compromising overall system performance.

POD-Attention distinguishes itself as the first GPU kernel specially tailored to efficiently compute attention for hybrid batches. The kernel seeks to maximize GPU utilization by allowing concurrent execution of prefill and decode operations within the same Streaming Multiprocessor (SM), effectively enhancing both compute and memory bandwidth usage. Integrating POD-Attention with Sarathi-Serve, a state-of-the-art LLM inference scheduler, showcases its practical benefits: accelerating attention computation by up to 75%, and improving throughput by up to 22% in offline inference scenarios.

Technical Approach

POD-Attention leverages a novel SM-aware CTA scheduling technique that guarantees concurrent execution of prefill and decode operations. This method ensures that prefill and decode kernels are co-located on the same SM, alleviating the performance bottlenecks associated with existing techniques that separate these phases. The kernel further optimizes GPU resource allocation using fine-tuned configurations, such as varying tile sizes and the number of CTAs per SM. By fostering concurrent execution, the kernel maximizes the utilization of GPU tensor cores and shared memory, leading to substantial performance gains.

Numerical and Empirical Results

The numerical results elucidated in this work evidence the substantial gains offered by POD-Attention, with observed attention computation speedups reaching up to 75% over traditional methods. The kernel consistently outperforms alternatives, including FlashAttention and FlashInfer, across various workload configurations. Specifically, the integration with Sarathi-Serve not only yields enhanced throughput but also reduces crucial latency metrics, such as time-to-first-token (TTFT) and time-between-tokens (TBT), demonstrating the efficacy of the approach in both offline and online inference paradigms.

Implications and Future Directions

POD-Attention holds significant implications for LLM serving systems, particularly as context lengths continue to extend in modern applications. By addressing the inefficiencies inherent in traditional separate-phase optimizations, this work provides a clear path to more robust and scalable LLM deployments. The kernel’s approach could inspire similar techniques across other ML model architectures, fostering greater concurrency and resource optimization.

Looking forward, the kernel's extension to support upcoming hardware architectures, such as NVIDIA's Hopper, and its integration with advanced inference scheduler systems could further streamline LLM inference performance, maintaining pace with the ever-evolving demands of AI workloads.

In conclusion, POD-Attention represents a crucial step towards optimizing LLM inference by effectively maximizing within-batch resource utilization, setting a promising benchmark for future research in efficient AI model serving.

PDF Markdown

Tweets

https://twitter.com/gm8xx8/status/1849282761950937467

https://twitter.com/HPCPapers/status/1849330947419754513