POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference (2410.18038v1)
Abstract: Each request in LLM inference goes through two phases: compute-bound prefill and memory-bandwidth-bound decode. To improve GPU utilization, recent systems use hybrid batching that combines the prefill and decode phases of different requests into the same batch. Hybrid batching works well for linear operations as it amortizes the cost of loading model weights from HBM. However, attention computation in hybrid batches remains inefficient because existing attention kernels are optimized for either prefill or decode. In this paper, we present POD-Attention -- the first GPU kernel that efficiently computes attention for hybrid batches. POD-Attention aims to maximize the utilization of both compute and memory bandwidth by carefully allocating the GPU's resources such that prefill and decode operations happen concurrently on the same multiprocessor. We integrate POD-Attention in a state-of-the-art LLM inference scheduler Sarathi-Serve. POD-Attention speeds up attention computation by up to 75% (mean 28%) and increases LLM serving throughput by up to 22% in offline inference. In online inference, POD-Attention enables lower time-to-first-token (TTFT), time-between-tokens (TBT), and request execution latency versus Sarathi-Serve.
- ccdv/arxiv-summarization. https://huggingface.co/datasets/ccdv/arxiv-summarization.
- CUDA C Programming Guide – Hardware Implementation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/#hardware-implementation.
- Parallel Thread Execution ISA Version 8.5 – Cooperative Thread Arrays. https://docs.nvidia.com/cuda/parallel-thread-execution/#cooperative-thread-arrays.
- Parallel Thread Execution ISA Version 8.5 – Special Registers: %smid. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-smid.
- https://github.com/vllm-project/vllm.
- Programming Tensor Cores in CUDA 9. https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/, 2017.
- FlashAttention. https://github.com/Dao-AILab/flash-attention, 2022.
- Flash-Decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html, 2023.
- Tensorrt-llm: A tensorrt toolbox for optimized large language model inference. https://github.com/NVIDIA/TensorRT-LLM, 2023.
- AI Infrastructure Spending Forecast to Be Over a Trillion Dollars Over the Next Five Years. https://www.delloro.com/news/ai-infrastructure-spending-forecast-to-be-over-a-trillion-dollars-over-the-next-five-years/, 2024.
- Llama-2-7B. https://huggingface.co/meta-llama/Llama-2-7b-hf, 2024.
- Merged PR 1865: Critical bug fixes related to sampling. https://github.com/microsoft/sarathi-serve/commit/50e59c51b85b1157e001bb8ee7a1b049d551955d#diff-450b0de5cce8a2341140afed859dc5dd3b913fa6e62d27988fccefeacc7b33ec, 2024.
- Meta-Llama-3-8B. https://huggingface.co/meta-llama/Meta-Llama-3-8B, 2024.
- Multi-Process Service. https://docs.nvidia.com/deploy/mps/index.html, 2024.
- NVIDIA Multi-Instance GPU. https://www.nvidia.com/en-us/technologies/multi-instance-gpu/, 2024.
- NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass, 2024.
- Performance and Tuning. https://docs.vllm.ai/en/v0.6.0/models/performance.html, 2024.
- Sarathi-Serve. https://github.com/microsoft/sarathi-serve, 2024.
- The State of AI Infrastructure at Scale 2024. https://ai-infrastructure.org/wp-content/uploads/2024/03/The-State-of-AI-Infrastructure-at-Scale-2024.pdf, 2024.
- Unify the kernel used in flash attention backend. https://github.com/vllm-project/vllm/pull/6052, 2024.
- Upstream Chunked Prefill. https://github.com/vllm-project/vllm/issues/3130, 2024.
- Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K, 2024.
- Mnemosyne: Parallelization strategies for efficiently serving multi-million context length llm inference requests without approximations, 2024.
- Vidur: A large-scale simulation framework for llm inference. Proceedings of The Seventh Annual Conference on Machine Learning and Systems, 2024, Santa Clara, 2024.
- Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Flux: Fast software-based communication overlap on gpus through kernel fusion, 2024.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference, 2024.
- Flashdecoding++: Faster large language model inference on gpus, 2024.
- Inference without interference: Disaggregate llm inference for mixed downstream workloads. arXiv preprint arXiv:2401.11181, 2024.
- Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference, 2023.
- Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, page 402–416, New York, NY, USA, 2022. Association for Computing Machinery.
- A framework for fine-grained synchronization of dependent gpu kernels. In Proceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’24, page 93–105. IEEE Press, 2024.
- Efficient memory management for large language model serving with pagedattention. SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- Automatic horizontal fusion for gpu kernels. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 14–27, 2022.
- Stream-k: Work-centric parallel decomposition for dense matrix-matrix multiplication on the gpu. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP ’23, page 429–431, New York, NY, USA, 2023. Association for Computing Machinery.
- Improving gpgpu concurrency with elastic kernels. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’13, page 407–418, New York, NY, USA, 2013. Association for Computing Machinery.
- Splitwise: Efficient generative llm inference using phase splitting. In ISCA, June 2024.
- vattention: Dynamic memory management for serving llms without pagedattention, 2024.
- Lean attention: Hardware-aware scalable attention mechanism for the decode-phase of transformers, 2024.
- Flashattention-3: Fast and accurate attention with asynchrony and low-precision. 2024.
- Fairness in serving large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 965–988, Santa Clara, CA, July 2024. USENIX Association.
- Powerinfer: Fast large language model serving with a consumer-grade gpu, 2023.
- Dynamollm: Designing llm inference clusters for performance and energy efficiency, 2024.
- Scalable kernel fusion for memory-bound gpu applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 191–202, 2014.
- Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, ASPLOS 2023, page 93–106, New York, NY, USA, 2022. Association for Computing Machinery.
- Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism, 2024.
- Fast distributed inference serving for large language models, 2023.
- Enabling and exploiting flexible task assignment on gpu through sm-centric program transformations. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS ’15, page 119–130, New York, NY, USA, 2015. Association for Computing Machinery.
- Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 107–118, 2012.
- Accelerating self-attentions for llm serving with flashinfer, February 2024.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
- Ispa: Exploiting intra-sm parallelism in gpus via fine-grained resource management. IEEE Transactions on Computers, 72(5):1473–1487, 2023.
- Sglang: Efficient execution of structured language model programs, 2024.
- DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, Santa Clara, CA, July 2024. USENIX Association.
- Nanoflow: Towards optimal large language model serving throughput, 2024.
- Aditya K Kamath (169 papers)
- Ramya Prabhu (2 papers)
- Jayashree Mohan (17 papers)
- Simon Peter (10 papers)
- Ramachandran Ramjee (20 papers)
- Ashish Panwar (8 papers)