PagedEviction: Block-Aligned KV Cache Management
- PagedEviction is a block-wise KV cache management method that evicts entire memory pages based on aggregated importance scores to maintain structured cache budgets.
- The algorithm uses token-level scoring during the prefill phase and block-level aggregation during the decode phase to efficiently mitigate cache memory growth in long-context LLM inference.
- Empirical evaluations show significant accuracy, throughput, and latency improvements while integrating seamlessly with systems like vLLM PagedAttention without kernel modifications.
PagedEviction is a structured, block-wise algorithm designed for efficient Key-Value (KV) cache management in LLM inference, particularly within paged memory layouts as utilized by frameworks such as vLLM's PagedAttention. The technique addresses the KV cache memory bottleneck that arises during long-context autoregressive generation by evicting less important attention states at the granularity of memory blocks ("pages"), thereby maintaining strict cache budgets and high inference accuracy. Unlike prior approaches based on token-level importance or page-crossing evictions, PagedEviction aligns eviction decisions to the underlying block structure and integrates seamlessly with PagedAttention—requiring no modifications to optimized CUDA kernels.
1. Motivation and Design Rationale
PagedEviction was developed to mitigate the rapid growth of KV cache memory as sequence length increases during LLM inference. In the context of autoregressive models, the KV cache (storing key and value vectors for each processed token) can surpass the memory footprint of model weights, especially for long-context tasks. Existing cache management strategies either rely on attention-based token importance scores or perform evictions across page boundaries, often leading to severe memory fragmentation and frequent cache updates. PagedEviction circumvents these limitations by employing block-aligned eviction—evicting entire memory pages—thus preserving cache structure and reducing computational overhead while retaining critical inference states.
The strategy suits the paged memory layouts of vLLM and similar systems, in which the KV cache is partitioned into blocks (pages) to mitigate fragmentation. By design, PagedEviction maintains alignment between algorithmic eviction actions and the underlying memory mapping, enabling efficient and deterministic cache reasoning.
2. Block-wise Eviction Algorithm
PagedEviction operates in two distinct phases: the prefill (prompt) phase and the decode (generation) phase, each targeting different points in the inference workflow.
A. Prefill Phase (Prompt Processing)
- Upon receiving an input sequence (prompt), the algorithm computes a token importance score for each token :
where and are the value and key vectors, respectively.
- The tokens with the lowest importance scores are selected for eviction (number ) to conform to a specified cache budget .
- After this one-time token-level pruning, the remaining tokens are partitioned into pages (blocks) of fixed size . No further token-level evictions occur; all subsequent actions are block-aligned.
B. Decode Phase (Autoregressive Generation)
- After each generation step, when the current length satisfies (i.e., a new block has been completed), the algorithm computes a block importance score for each block :
- The block with the lowest aggregated score is designated for eviction in its entirety.
- Eviction is only triggered when a new block is filled, minimizing cache update frequency and maintaining block alignment.
Algorithmic Structure
The algorithm is specified in the paper's pseudocode (Algorithm 1–3):
- Importance estimation pseudocode:
1 2 3 4
for token in tokens: S[token] = norm(V[token]) / norm(K[token]) for block in blocks: S[block] = average([S[token] for token in block])
- Prefill phase:
- Compute for all tokens
- Select and evict least important tokens to meet cache budget
- Partition remaining tokens into blocks of size
- Decode phase:
- For every step where :
- Compute for all blocks
- Evict block with minimum
3. Integration with vLLM PagedAttention
PagedEviction is explicitly designed for compatibility with PagedAttention, which organizes the KV cache into physical memory pages to avoid fragmentation and optimize memory allocation. The algorithm:
- Evicts blocks strictly within page boundaries, matching the allocation and mapping of PagedAttention.
- Performs importance computations using only key and value vectors, requiring no access to attention weights, which may not be available under optimized kernel implementations (e.g., FlashAttention).
- Integrates without any changes to CUDA attention kernels—allowing immediate deployment in vLLM or similar systems.
This design ensures minimal impact on underlying memory management or inference pipelines and supports efficient error-free cache operation across varying context lengths.
4. Empirical Evaluation and Performance
The efficacy of PagedEviction was evaluated on Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct using the LongBench benchmark suite. Several findings are noted:
A. Accuracy Under Constraint
- PagedEviction maintains higher model accuracy as measured by ROUGE score compared to StreamingLLM, KeyDiff, and token-level importance methods under tight cache budgets. For example, at a 1024-token cache budget on GovReport using Llama-3.2-1B, the algorithm achieves ROUGE ≈ 24.5—representing a 15–20% improvement over baselines.
- On MultiNews (Llama-3.2-3B), PagedEviction achieves ROUGE ≈ 23.6 at budget 1024, outperforming unstructured Inverse Key L2-Norm by 1.1 points.
B. Throughput and Latency
- Throughput increases substantially under constrained memory: Llama-1B achieves up to 3020 tokens/sec at budget 1024 (a ~37% increase over Full Cache's 2200 tokens/sec).
- Latency per output token is decreased by 10–12% across all model scales.
- These performance gains result from block-wise evictions minimizing update and fragmentation overhead.
C. Structural Cache Integrity
Block-wise eviction prevents fragmentation endemic to token-level eviction in paged memory layouts. It also reduces the complexity and frequency of cache modifications—leading to better system throughput and lower time per inference step.
5. Implications and Research Directions
PagedEviction advances memory management for LLM inference by providing:
- A practical method for enforcing strict memory budgets without severely degrading model accuracy, particularly on long-context tasks.
- Seamless integration with existing PagedAttention/out-of-core caching frameworks.
- Significant throughput and latency improvements, validated across multiple model scales and datasets.
Future research directions proposed in the paper include:
- Investigating more advanced proxy importance metrics or adaptive block scoring functions, potentially refining the balance between memory usage and inference quality.
- Exploring layer-wise budget allocation and synergies with KV cache quantization or hybrid eviction strategies.
- Dynamic adaptation of block size based on runtime inference statistics.
A plausible implication is that PagedEviction's principles could generalize to other structured memory management domains, including hierarchical block caches, distributed inference, and resource-constrained on-device deployment scenarios.
6. Methodological Significance
PagedEviction exemplifies the transition from unstructured to structure-aware cache management in machine learning systems. By synchronizing algorithmic eviction logic with memory subsystem boundaries and leveraging statistically meaningful proxy signals (ratio of ), the method provides a reproducible, explainable, and robust approach to KV cache pruning for stateful sequence modeling.
In summary, PagedEviction is a principled, empirically validated, block-wise KV cache pruning strategy that enhances LLM inference efficiency under tight memory constraints, fully compatible with paged memory layouts and requiring no kernel-level alteration—enabling scalable and accurate long-context generation.