PagedAttention and Fine-Grained Caching

Updated 8 February 2026

PagedAttention divides the KV cache into fixed-size pages, enabling dynamic memory allocation that minimizes internal fragmentation and supports efficient multi-sequence processing.
Fine-grained caching uses small page sizes to precisely match sequence requirements, reducing GPU memory waste to below 5% while boosting throughput by up to 40%.
PagedEviction applies block-level importance scoring for structured cache pruning, achieving higher ROUGE scores and optimized resource utilization in LLM inference.

PagedAttention is a memory management paradigm for LLM inference that partitions the key-value (KV) cache into fixed-size blocks (“pages”) and maintains an explicit logical-to-physical mapping via per-sequence page tables. Fine-grained caching refers to the use of small page/block sizes and structured page allocation/eviction to minimize both internal and external fragmentation, enabling dense utilization of GPU memory resources during dynamic, multi-request LLM serving. This approach, as developed in the vLLM system and subsequently extended and integrated with frameworks like IBM’s Foundation Model Stack (FMS), addresses the primary KV cache bottleneck that restricts batch sizes and throughput in LLM deployments. The algorithmic core is the separation of logical sequence positions from the physical backing memory, facilitating efficient allocation, sharing (e.g., for forked/beam-searched sequences), and block-wise eviction—all with minimal compute overhead.

1. PagedAttention Architecture and Principles

PagedAttention divides each layer’s KV cache into pages (blocks) of size $B$ tokens. For a sequence of length $S$ , there are $P = \lceil S/B \rceil$ pages. Each token $i$ , $1 \leq i \leq S$ , resides in page $p(i) = \lfloor (i-1)/B \rfloor$ with an in-page offset $o(i) = (i-1) \bmod B$ . For each active sequence, a page table maps logical page indices to physical GPU memory buffers. Upon request, new blocks are allocated and added to this mapping; when a block is no longer required (e.g., due to sequence end or explicit eviction), its entry is returned to a global freelist. This structure eliminates the need for large contiguous KV memory allocations, substantially reducing internal fragmentation and enabling on-demand memory growth tailored to the actual sequence lengths. The layout also supports copy-on-write for forking, parallel sampling, and beam search, with reference counting on pages enabling safe and efficient shared-memory management (Kwon et al., 2023, Joshi et al., 8 Jun 2025).

2. Fine-Grained Caching and Block Granularity

Fine-grained caching refers to the use of small page/block sizes (e.g., $B=16$ , $32$, or $64$ tokens per page) in the KV cache, which allows the serving system to match resource allocation precisely to sequence requirements. Internal fragmentation per sequence is thus bounded above by $B$ tokens, and overall GPU memory waste is reduced to less than 5% in multi-sequence, long-context workloads (Joshi et al., 8 Jun 2025, Kwon et al., 2023). Ablations on page size demonstrate that $B=16$ –$32$ achieves the optimal trade-off between eviction overhead and memory efficiency: smaller $B$ leads to more frequent page table updates, whereas larger $B$ increases worst-case slack upon incomplete block usage (Chitty-Venkata et al., 4 Sep 2025). The blockwise structure further enables block-level pruning and recycling of cache slots across divergent sequence workloads.

3. Structured Block-wise KV Cache Pruning: PagedEviction

PagedEviction operates as a fine-grained, structured KV cache pruning strategy that directly leverages paged memory layouts. Unlike token-wise schemes that require introspection into attention matrices or costly cross-block data movement, PagedEviction defines an importance score per token, $S_i = \|V_i\|_2 / \|K_i\|_2$ , and aggregates per-block (page) importance scores $S_j = \frac{1}{B} \sum_{i \in \text{page } j} S_i$ . During the prefill phase (prompt), all $L$ token scores are computed and the lowest $L-C$ are pruned to enforce a cache budget of $C$ , following a single $O(L \log L)$ sort per layer. In decode (autoregressive generation), evictions happen at page boundaries: when a block fills, $S_j$ is computed for all resident pages, and the page with the lowest score is evicted, freeing both buffer and page-table entry. Results on LongBench (Llama-3.x models, budget $1024$) show PagedEviction maintains higher ROUGE than StreamingLLM and KeyDiff at the same memory bound, with $+15$ – $20\%$ absolute improvement and throughput gains of $30$– $40\%$ over baselines. This approach entirely avoids costly CUDA kernel changes, operating exclusively at the host/runtime and interfacing directly with vLLM’s page management (Chitty-Venkata et al., 4 Sep 2025).

4. Integration in Flexible Attention Backends

The paged cache design is compatible with diverse attention kernel implementations, notably PyTorch’s FlexAttention. Integration involves fusing the Gather operation (collecting scattered KV-blocks for the active sequence range) as an on-the-fly index-based read in the attention kernel. The page-table is maintained in GPU global memory, and indirection vectors $id_k, id_v$ specify for each token $t$ the offset $p(t) = \text{PT}[b(t)] \cdot B + o(t)$ . Hooks such as mask_mod in FlexAttention restrict attention to within-sequence and within-page contexts, preserving correctness in batched, multi-sequence inference. Empirical benchmarks on IBM’s FMS show that with PagedAttention and FlexAttention fusion, inference latency grows only $\sim2\times$ when sequence length increases $16\times$ ( $128 \to 2048$ tokens), while uncached or monolithic approaches suffer exponential latency scaling. Memory overhead remains below $5\%$ for all realistic workloads (Joshi et al., 8 Jun 2025).

5. Fragmentation, Contiguity, and Performance Trade-offs

PagedAttention achieves near-zero external fragmentation, as all free space is contained within the last (potentially incomplete) page of each active sequence. Internal fragmentation per sequence is at most $B$ tokens. Compared to monolithic allocations which suffer $60$– $80\%$ dead memory in mixed-length batches, block-paged schemes keep waste $<5\%$ . However, the loss of virtual contiguity (i.e., non-contiguous per-request memory) introduces small but measurable kernel throughput penalties (GPU-side: $20$– $28\%$ slower than contiguous-input kernels; CPU-side: up to $10\%$ decode latency for block-table construction) (Prabhu et al., 2024). Alternative systems like vAttention address contiguity by using CUDA virtual memory APIs to preserve the virtual address view for kernels—supporting contiguous-input attention kernels unmodified and hiding most physical allocation granularity from user code. The contiguity ratio for vAttention is always $1$, whereas for PagedAttention it approaches $B/L_\text{max}$ . In practice, all systems based on paged caches benefit from lower overall contention and higher throughput at large batch sizes compared to static pre-allocation (Prabhu et al., 2024).

6. Advanced Applications, Extensions, and Quantitative Results

PagedAttention and its fine-grained caching enable advanced features such as copy-on-write page sharing across forks or beams, deferred reclamation for low-latency serving, hierarchical or mixed page sizes, and the possibility of external swapping or rematerialization policies under resource pressure. Evaluation across representative models (Llama-1B, Llama-3B, Llama-8B) and benchmarks (LongBench, ShareGPT, WikiText-103) consistently shows:

Batch size capacity doubles or quadruples versus contiguous allocators at the same memory footprint (Kwon et al., 2023);
Latency per output token is reduced by $10$– $12\%$ with block-sized amortization of eviction overhead (Chitty-Venkata et al., 4 Sep 2025);
No empirical loss in perplexity compared to monolithic caches (e.g., 7.32 vs. 7.31 for LLaMA-7B on WikiText-103) (Joshi et al., 8 Jun 2025);
Throughput improvements of $1.2\times$ – $2.0\times$ across high-throughput, long-context scenarios (Prabhu et al., 2024, Kwon et al., 2023).

These results establish paged, fine-grained KV caching as the prevailing architecture for high-capacity LLM serving, with ongoing research into minimizing kernel-side penalties (e.g., via vAttention), hardware/driver API support for finer page sizes, and further amortization strategies.

Table 1: Comparative Metrics Across Approaches

Scheme	Max Internal Frag.	Contiguity Ratio ( $\mathcal C$ )	Required Kernel Changes
Monolithic	$O(M_{\text{max}}-L)$ tokens	$1$	None
PagedAttention	$\leq B$ tokens	$B/L_{\text{max}}$	Yes (paged kernels)
vAttention	$\leq K'$ tokens	$1$	No (virtual memory)

Where $B$ is block/page size, $K'$ is the page size in tokens for vAttention, and $L$ is the actual decoded length.

Extensions under development include hierarchical block/page sizes, transparent GPU-CPU demand paging for cache spillover, and application of similar fine-grained paging to activations or optimizer states in training and fine-tuning. Integration with other frameworks (FlexAttention, FlashInfer) and abstraction via virtual memory APIs (as in vAttention) open further cross-hardware and cross-vendor opportunities. The principles of fine-grained paging and page-aware address segmentation (as in TransFetch for hardware prefetch (Zhang et al., 2022)) are finding applications in both ML system software and ML-accelerated hardware architectures.

PagedAttention and fine-grained caching have, within a short timespan, become cornerstones of efficient, scalable LLM inference and memory management, enabling unprecedented throughput, minimal hardware waste, and tractable support for extreme-length contexts in future LLM systems.