Chunked Prefill: Efficient LLM Inference
- Chunked prefill is a batching strategy that partitions long input sequences into fixed-size chunks, enabling sequential processing through transformer layers and efficient KV caching.
- It optimizes hardware utilization by overlapping prefill and decode phases, reducing time-to-first-token and boosting throughput by factors as high as 10× in piggybacked decoding.
- Adoption across dense, sparse, and MoE models has shown significant improvements in service-level objectives, though it introduces trade-offs in latency, memory traffic, and scheduling complexity.
Chunked prefill is a scheduling and batching strategy for transformer-based LLM inference, wherein long input sequences are partitioned into smaller, fixed-size chunks and processed sequentially through all model layers. Each chunk populates key–value (KV) caching for subsequent decode operations, enabling efficient utilization of hardware resources during the prefill (prompt ingestion) phase and reducing time-to-first-token (TTFT) latency. Its adoption across dense, sparse, and Mixture-of-Experts (MoE) models has established chunked prefill as a foundational regime for modern LLM serving, but it introduces nontrivial trade-offs in latency, throughput, memory traffic, and scheduling complexity.
1. Formulation and Algorithmic Structure
Chunked prefill splits a prompt of length into contiguous chunks of size . For each chunk, forward computation is performed through all transformer layers, appending the newly computed KV vectors to the accumulated cache, then continuing to the next chunk. After all chunks are processed, autoregressive decode resumes using the completed cache (Agrawal et al., 2023, Lee et al., 9 Oct 2025). The scheduling loop typically interleaves chunked-prefill tasks with decode tasks in hybrid micro-batches, exploiting fused matmul kernels and maximizing hardware throughput (Agrawal et al., 2023).
Pseudocode (Abstracted):
7 Each chunk operation is compute-bound and admits compute–decode piggybacking: decodes utilize the already-loaded model weights, amortizing GPU resource usage.
2. Chunked Prefill in Serving Architectures
Chunked prefill appears as the default iteration granularity in multiple LLM serving frameworks: Sarathi-Serve (Agrawal et al., 2023), vLLM, SGLang, RServe (Guo et al., 29 Sep 2025), and is core to pipeline-parallel, data-parallel, and disaggregated PD architectures (Shi et al., 9 Jul 2025, Wang et al., 4 Aug 2025). In Sarathi-Serve, decode-maximal batching fuses one prefill chunk with as many decodes as possible, allowing compute-intensive prefill and memory-bound decode to efficiently share large matmuls. In RServe’s multi-modal and pipeline-disaggregated architecture, chunked prefill enables fine-grained overlap between encoder outputs and LLM prefill, supporting both intra-request and inter-request pipelines for improved parallelism (Guo et al., 29 Sep 2025).
Prefill–decode disaggregation frameworks (such as TaiChi (Wang et al., 4 Aug 2025)) use chunk size as an explicit control lever to trade off TTFT and time-per-output-token (TPOT) across different pools of GPU hardware, thus generalizing chunked prefill as a fundamental axis for balancing latency and throughput SLOs across a wide range of deployment regimes.
3. Performance, Latency, and Throughput Trade-Offs
The size of each chunk, , directly modulates key SLO metrics:
- TTFT (Time-to-First-Token): Small improves responsiveness by reducing the single-iteration time to process the initial portion of a long prompt, mitigating head-of-line blocking and allowing high-priority requests to preempt ongoing work (Hsieh et al., 18 Feb 2026). However, smaller chunks entail more kernel launches and iteration overhead, limiting throughput.
- TPOT (Time-Per-Output-Token): Large can cause phase interference between prefill and decode, where memory-bound decode operations wait on compute-bound prefill, increasing TBT (time-between-tokens) (Shi et al., 9 Jul 2025, Lee et al., 9 Oct 2025). Small better isolates decode, keeping TBT low.
Empirically, chunk sizes of 256–512 tokens yield strong throughput–latency performance for dense models (Agrawal et al., 2023), whereas Tightly-constrained SLOs (TBT ms) may require even smaller (Lee et al., 9 Oct 2025).
Quantitative impacts (representative):
- For LLaMA-13B/A6000: Up to 10 higher decode throughput (piggybacked) and 0 higher end-to-end throughput relative to decode-only or pure prefill (Agrawal et al., 2023).
- In production workloads (Qwen3-30B, 9k-token arXiv): TTFT mean reduced from 4.50s (no chunking) to 2.80s (chunked S=512), with TBT mean reduced from 45ms to 32.9ms (Lee et al., 9 Oct 2025).
The trade-off forms a continuum: decreasing 1 reduces TTFT but adds overhead, while increasing 2 improves throughput but exacerbates decode blocking (Hsieh et al., 18 Feb 2026).
4. System-Level Challenges and Optimizations
4.1. Phase Interference and Intra-GPU Disaggregation
When prefill and decode are co-batched in the same GPU streams, kernel-level profiling shows memory-bound decode kernels can be delayed by up to 8–103 compared to decode-only batches, as prefill matmuls monopolize streaming multiprocessors (SMs) and off-chip bandwidth (Shi et al., 9 Jul 2025). Systems such as Nexus decouple GPU resources dynamically between prefill and decode, partitioning SMs and using two independent schedulers (prefill: shortest-prompt-first; decode: FCFS) to virtually disaggregate the phases within a single device, substantially reducing TTFT and TBT while achieving up to 2.24 higher throughput (Shi et al., 9 Jul 2025).
4.2. Scheduling and Batching
Chunked prefill enables advanced batching techniques—e.g., decode-maximal batching in Sarathi (Agrawal et al., 2023) and token-budgeted micro-batching in RServe (Guo et al., 29 Sep 2025)—to align hardware saturation with workload variability. In multi-GPU or disaggregated settings, chunk scheduling can be dynamically tuned across heterogeneous instances for optimal SLO attainment (Wang et al., 4 Aug 2025).
Slack-aware scheduling (e.g., S-EDF in FlowPrefill (Hsieh et al., 18 Feb 2026)) uses per-request slack windows and batch-token budgets to optimize admission control, minimizing wasted compute on requests destined to miss SLO deadlines while still achieving high utilization. Fine-grained preemption at operator or layer level (rather than at chunk boundaries) can further reduce preemption blocking by 3.5–4.25 (Hsieh et al., 18 Feb 2026).
4.3. Limitations in MoE, Memory, and Bandwidth
In Mixture-of-Experts (MoE) models, chunked prefill with small 6 erodes weight sparsity: each chunk reloads a redundant set of expert weights, driving memory traffic up to 39% higher and energy per token by 22% versus naive batching (Lee et al., 9 Oct 2025). This is especially acute for long contexts and large expert sets, where redundant parameter loads and sparse-expert underutilization render chunked prefill suboptimal.
Memory bottlenecks, particularly during the prefill stage, motivated the development of schemes like MOM, which partition intermediate activations into mini-sequences internal to each MLP and offload KV caches to CPU, reducing prefill memory consumption by over 50% and enabling significantly longer context lengths (e.g., 455k tokens vs 338k for chunked prefill) (Zhang et al., 16 Apr 2025).
5. Specialized Techniques: Chunked Prefill as Substrate
5.1. Sparse Attention and KV Selection
Chunked prefill serves as a substrate for various acceleration techniques:
- QUOKA performs two-stage token-level KV selection during chunked prefill, maintaining near-dense accuracy while reducing the number of KV pairs by 88% and attention compute by up to 77 (Jones et al., 9 Feb 2026). Its query-wise sub-selection is compatible with common chunked prefill regimes.
- CompactAttention further exploits block-union KV selection: starting from 2D block-sparse masks (from selectors like FlashPrefill), it performs Q-block union and GQA-group intra-group union to derive minimal per-group KV block tables. This enables zero-copy paged attention, eliminating explicit KV gathering, and yields up to 2.728 attention speedup (128k context) with accuracy within 0.3 pp of dense attention (Song et al., 16 May 2026).
Table: Chunked prefill acceleration approaches
| Approach | Key Mechanism | Speedup / KV Reduction | Accuracy Impact |
|---|---|---|---|
| QUOKA | Query-oriented KV | 5–79 attention | <3% drop (typical) |
| CompactAttention | Block-union, paged zero-copy | 2.720 at 128k context | ≲0.3 pp from dense |
| MOM | Mini-sequence + offload | 1.5–21 memory reduction | None (identical) |
5.2. Multimodal and Multi-turn Scenarios
Chunked prefill is integral to multimodal serving engines (RServe), where it enables overlapping multimodal encoding with LLM prefill, both intra-request (streamed embeddings) and inter-request (token-budgeted batching) (Guo et al., 29 Sep 2025). In multi-turn conversational settings, append-prefill mechanisms reuse cached KV states for successive turns, amortizing the quadratic prefill cost and, when dynamically routed, can yield a 68% reduction in Turn-2+ TTFT (Li et al., 9 Mar 2026).
6. Practical Guidelines and Future Directions
Choosing chunk size 2: The main guidelines—drawn from Sarathi, TaiChi, and Layered Prefill—are:
- For strict TTFT SLOs (3 s), large 4 (e.g., 512–1024) amortizes iteration overhead.
- For strict TPOT (5100 ms/token), small 6 (128–256) isolates decode-sensitive batches.
- For MoE or long-prompt workloads, hybrid chunk+layer or block-union schemes offer better trade-offs to avoid memory “explosion” (Lee et al., 9 Oct 2025, Song et al., 16 May 2026).
Hybrid architectures (TaiChi) adapt chunk size and resource assignment per hardware pool and workload regime to maximize overall goodput under joint TTFT/TPOT SLOs, obtaining 20–77% higher request-attainment rates compared to pure aggregation or disaggregation (Wang et al., 4 Aug 2025).
Ongoing work explores dynamic per-layer or per-request chunk sizing, fused attention/MLP partitioning, and further integration with offloading and quantization. Memory-efficient management (e.g. ContiguousKV's chunk-aligned prefetch) now targets decode-stage KV bottlenecks as prefill pressure is relieved (Zou et al., 20 Jan 2026, Zhang et al., 16 Apr 2025).
Chunked prefill remains a central paradigm in LLM serving: it is directly extensible to dense, sparse, MoE, multi-modal, and multi-turn workloads; but effective deployment requires careful balancing of chunk size, pipeline structure, resource scheduling, and integration with advanced acceleration and caching techniques to avoid phase interference, memory inefficiency, and bandwidth bottlenecks.