PagedAttention: Efficient Memory Management for LLMs
- PagedAttention is a memory management approach that fragments KV caches into fixed-size pages, enabling efficient transformer LLM inference.
- It dynamically allocates fixed-size blocks and uses per-sequence block tables to minimize memory fragmentation and support long-context decoding.
- Empirical results demonstrate up to 24× throughput improvements and significant memory savings in high-concurrency serving and inference tasks.
PagedAttention is a memory management and caching framework for efficient key-value (KV) cache allocation in transformer-based LLM inference and training. Inspired by operating system virtual memory paging, PagedAttention fragments each sequence’s KV cache into fixed-size blocks that are dynamically allocated and referenced through per-sequence block tables. This approach dramatically reduces the internal and external fragmentation caused by monolithic allocation schemes, thereby enabling higher throughput, greater effective batch sizes, and support for long-context decoding on resource-constrained hardware.
1. Principles of PagedAttention
PagedAttention maps the logical KV cache for each sequence into fixed-size, physically noncontiguous blocks (or "pages"). Each block contains KV pairs for a fixed number of tokens, typically 16–128, and may reside anywhere in device memory. A per-sequence table maintains the mapping from logical (token position) to physical (memory address) blocks, analogous to page tables in operating systems (Kwon et al., 2023).
Key steps:
- When a new token fills up the active block, a new physical block is allocated and registered in the page table.
- Attention kernels gather KV states for a sequence by traversing page tables and fetching the appropriate blocks.
- When a sequence completes or requires memory reclamation, all blocks registered in its table can be promptly released.
This structure enables on-demand allocation and fine-grained reclamation, resulting in minimal idle memory. Additionally, prefix blocks can be shared between sequences (for e.g., beam search, parallel sampling), leveraging copy-on-write semantics for block modifications (Kwon et al., 2023, Kolluru, 17 Nov 2025).
2. Mathematical and Algorithmic Formulation
Let be the number of concurrent sequences, the maximum sequence length, the hidden dimension, the number of attention heads, and the block size (tokens per block). Instead of statically reserving memory per batch, PagedAttention only allocates memory for active (populated) blocks:
In steady-state generation, internal fragmentation per sequence is limited to tokens. Memory efficiency is quantified as the ratio of utilized vs. reserved cache:
where is the per-token storage cost, and is bytes in use (Prabhu et al., 7 May 2024).
The allocation and retrieval algorithms operate as follows:
1 2 3 4 5 6 |
if ctx_len[seq] % b == 0: block_id = alloc_block() page_table[seq].append(block_id) offset = ctx_len[seq] % b K_blocks[block_id][offset] = current_K V_blocks[block_id][offset] = current_V |
PagedAttention supports rapid block (page) allocation and deallocation, and batch attention kernels gather all relevant blocks using the page table indirection.
3. Integration with Serving Systems and Attention Kernels
PagedAttention is centrally featured in the vLLM LLM serving system (Kwon et al., 2023), wherein the block engine manages a global pool of free device memory blocks and per-sequence block tables. The system supports:
- KV block sharing across multiple sequences and requests via reference counting.
- Copy-on-write semantics for block mutation in branched decoding (beam search, few-shot, RAG).
- Custom CUDA attention kernels that accept a list of block pointers and treat the concatenated blocks as input to fused matrix multiplication routines.
PyTorch’s FlexAttention model offers a programmable interface for implementing “paged” attention via mask_mod and score_mod hooks that accept arbitrary page-table–based masking and scoring. The FlexAttention kernel fuses these hooks into Triton or TorchInductor-generated GPU code, optimizing block-wise data movement and reducing the cost of indirection (Dong et al., 7 Dec 2024, Joshi et al., 8 Jun 2025). Empirical results show that FlexAttention with paging adds less than 2% kernel overhead and achieves up to 30% reduction in peak GPU memory at 16K–64K contexts compared to standard full-KV caching (Dong et al., 7 Dec 2024).
4. Quantitative Performance and Memory Efficiency
PagedAttention realigns memory usage to closely match actual token demand with minimal reserve overhead. Evaluations demonstrate the following:
| Model | Baseline Tput | PagedAttn Tput | TGI Tput | Peak Mem (GB) | Latency (s p99) |
|---|---|---|---|---|---|
| LLaMA-2-7B | 9,170 | 15,243 | 4,156 | 24.3 | 14.2 |
PagedAttention achieves up to 24× throughput improvement over HuggingFace TGI for high-concurrency workloads, reduces peak GPU KV memory by 20%, and consistently scales with batch size and sequence length (Kolluru, 17 Nov 2025). Benchmarks on FlexAttention kernels indicate decoding latency per token remains nearly flat as sequence length scales up to 64K (see Table 1 in (Dong et al., 7 Dec 2024)).
Memory gains are more pronounced for heterogeneous-length batches, long-context tasks, and when sharing prefixes. For batch average length , the theoretical reduction ratio is (e.g., up to 88% if ) (Kolluru, 17 Nov 2025).
5. Extensions: Eviction, Compression, and Quantization
Block-Wise Eviction
PagedEviction augments PagedAttention with structured block-wise KV cache pruning: entire pages are evicted based on token or block "importance" scores (e.g., ). This enables cache budgets with negligible accuracy loss under tight GPU constraints and without kernel changes (Chitty-Venkata et al., 4 Sep 2025).
Variable Compression
KV-Compress introduces variable compression rates per attention head, evicting contiguous KV blocks determined by per-head "least important" metrics. The memory footprint after eviction is ; further block rounding aligns to full pages. In vLLM integration, up to throughput increase and up to compression with >90% accuracy retention is reported (Rehg, 30 Sep 2024). The entire compression scheduling executes as custom on-GPU kernels.
Low-Precision KV Caching
PagedAttention memory layout is amenable to aggressive quantization. In practical systems, each KV page (block) can be quantized to 4 bits with group-wise min-max or Hessian-aware scaling (Srinivas et al., 30 May 2025). Streaming attention kernels decode and reconstitute these quantized pages on-the-fly with negligible (<0.1%) accuracy loss while quadrupling effective cache capacity.
6. Trade-Offs, Alternatives, and Limitations
PagedAttention requires noncontiguous tensor layouts, necessitating custom CUDA kernels or programmable GPU attention kernels with block pointer support. This design adds complexity for kernel development, maintenance, and integration with new kernel optimizations (e.g., grouped-query, FlashInfer, etc.) (Prabhu et al., 7 May 2024). For each attention algorithm or optimization, new paged-compatible kernels may be required, increasing engineering overhead.
Alternatives such as vAttention employ CUDA virtual memory APIs to decouple virtual from physical memory, retaining contiguous layouts for the KV cache and enabling unmodified attention kernel reuse (Prabhu et al., 7 May 2024). This approach achieves similar near-zero fragmentation, often with lower runtime and programming complexity at a mild cost in practical flexibility (e.g., porting to non-NVIDIA backends).
PagedAttention introduces minor kernel overheads (10–30% in some benchmarks (Prabhu et al., 7 May 2024)) due to additional indirection, especially as block size increases. CPU management of block tables can impact tail latencies. Block size tuning () is critical to balancing internal waste, GPU utilization, and eviction/reactivity overheads (Kwon et al., 2023, Chitty-Venkata et al., 4 Sep 2025).
7. Applications and Deployment Considerations
PagedAttention is deployed broadly in high-throughput, memory-bound LLM serving scenarios, including:
- vLLM (primary reference implementation), Foundation Model Stack (FMS), and other open-source LLM inference systems (Kwon et al., 2023, Kolluru, 17 Nov 2025, Joshi et al., 8 Jun 2025).
- Multi-user API endpoints and batch generators, maximizing served request rates under finite device memory.
- RAG, beam search, and parallel sampling tasks where prefix sharing yields additional memory compaction.
- Industrial agentic pipelines integrating quantized paged KV caches with speculative decoding and multi-pass inference for constrained environments (Srinivas et al., 30 May 2025).
PagedAttention is most beneficial as context lengths continue to increase and models adopt more advanced attention variants or high-concurrency serving patterns. Ongoing development includes further extensions for multi-tier cache hierarchies, background page eviction, efficient quantization, and integration with dynamic attention pruning frameworks (Chitty-Venkata et al., 4 Sep 2025, Rehg, 30 Sep 2024). PagedAttention remains a foundational mechanism for scalable LLM inference in contemporary production and research settings.