SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (2308.16369v1)

Published 31 Aug 2023 in cs.LG and cs.DC

Abstract: LLM inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x.

PDF Abstract

Analyzing Sarathi: Enhancing LLM Inference Efficiency

The paper "Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" presents a novel approach aimed at addressing performance inefficiencies in LLM inference, which has become a significant GPU workload due to the scaling of LLMs. Sarathi introduces two central techniques—chunked-prefills and decode-maximal batching—to optimize the inference process, thereby improving GPU utilization and reducing pipeline bubbles which are critical bottlenecks in LLM deployment.

At the core of LLM inference are two phases: a prefill phase for processing the input prompt and a decode phase for generating output tokens autoregressively. The decode phase typically results in inefficient compute utilization due to low parallelism, as each request usually generates one token at a time. Sarathi’s chunked-prefill technique subdivides a large prefill request into smaller, more uniformly compute-saturating chunks, which allows better resource allocation across micro-batches, reducing inefficiencies. This enables a single prefill chunk to include additional decode requests (maximal batching) in the same batch, allowing the decode processes to "piggyback" on prefill execution, thereby enhancing GPU utilization.

Sarathi’s explicit separation and focus on minimizing the resource disparity between prefill and decode phases present an insightful exploration into optimizing the LLM pipeline using existing hardware. The technique aligns with common challenges faced in model-parallel (and specifically scaling pipeline-parallel) LLM deployments by providing a mechanism to adjust for variable lengths and compute loads, effectively smoothing out the operational imbalances that can cause pipeline stalls or 'bubbles.'

The paper details experimental results demonstrating significant throughput improvements across various hardware and model configurations. For instance, using the LLaMA-13B model on an A6000 GPU, decode throughput was improved by up to a factor of ten, with 1.33x improvement in end-to-end throughput; for LLaMa-33B on an A100 GPU, Sarathi achieved 1.25x higher end-to-end throughput and up to 4.25x higher decode throughput. This substantiates the claim that Sarathi effectively optimizes GPU resource usage by addressing both the compute-bound and memory-bound nature of LLM inference phases.

Implications of this research extend beyond throughput improvements. Sarathi's approach suggests a scalable path for managing LLM inference at the multi-GPU node level, offering potential deployment benefits for cloud-based AI services and edge devices requiring efficient model executions. Furthermore, this technique may motivate additional transformer model innovations, or alternatively, adaptations to Sarathi's approach, in frameworks other than transformers, as the demand for real-time language processing continues to rise.

Considering theoretical developments, Sarathi sheds light on the fundamental trade-offs between model partitioning, parallelism, and execution resource balancing. This lays the groundwork for further exploration into chunky computing strategies that deliberately balance compute phases within AI workloads on contemporary heterogeneous compute architectures, fostering a deeper integration with emerging AI hardware accelerators.

Future work may focus on refining chunk size determination dynamically based on live system metrics or exploring Sarathi's applicability to other neural network architectures with similar phase-based bottlenecks. Additionally, investigations into the integration of Sarathi with other optimization layers, such as quantization and pruning, could further accentuate its utility in inference workloads characterized by stringent latency and throughput requirements.

Overall, Sarathi presents a robust methodology to navigate the complexities of LLM inference—a relevant contribution as AI systems continue to scale in capability and deployment.