Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

PagedEviction: Block-Aligned KV Cache Management

Updated 8 October 2025
  • PagedEviction is a block-wise KV cache management method that evicts entire memory pages based on aggregated importance scores to maintain structured cache budgets.
  • The algorithm uses token-level scoring during the prefill phase and block-level aggregation during the decode phase to efficiently mitigate cache memory growth in long-context LLM inference.
  • Empirical evaluations show significant accuracy, throughput, and latency improvements while integrating seamlessly with systems like vLLM PagedAttention without kernel modifications.

PagedEviction is a structured, block-wise algorithm designed for efficient Key-Value (KV) cache management in LLM inference, particularly within paged memory layouts as utilized by frameworks such as vLLM's PagedAttention. The technique addresses the KV cache memory bottleneck that arises during long-context autoregressive generation by evicting less important attention states at the granularity of memory blocks ("pages"), thereby maintaining strict cache budgets and high inference accuracy. Unlike prior approaches based on token-level importance or page-crossing evictions, PagedEviction aligns eviction decisions to the underlying block structure and integrates seamlessly with PagedAttention—requiring no modifications to optimized CUDA kernels.

1. Motivation and Design Rationale

PagedEviction was developed to mitigate the rapid growth of KV cache memory as sequence length increases during LLM inference. In the context of autoregressive models, the KV cache (storing key and value vectors for each processed token) can surpass the memory footprint of model weights, especially for long-context tasks. Existing cache management strategies either rely on attention-based token importance scores or perform evictions across page boundaries, often leading to severe memory fragmentation and frequent cache updates. PagedEviction circumvents these limitations by employing block-aligned eviction—evicting entire memory pages—thus preserving cache structure and reducing computational overhead while retaining critical inference states.

The strategy suits the paged memory layouts of vLLM and similar systems, in which the KV cache is partitioned into blocks (pages) to mitigate fragmentation. By design, PagedEviction maintains alignment between algorithmic eviction actions and the underlying memory mapping, enabling efficient and deterministic cache reasoning.

2. Block-wise Eviction Algorithm

PagedEviction operates in two distinct phases: the prefill (prompt) phase and the decode (generation) phase, each targeting different points in the inference workflow.

A. Prefill Phase (Prompt Processing)

  • Upon receiving an input sequence (prompt), the algorithm computes a token importance score for each token ii:

Si=Vi2Ki2S_i = \frac{\Vert V_i \Vert_2}{\Vert K_i \Vert_2}

where ViV_i and KiK_i are the value and key vectors, respectively.

  • The tokens with the lowest importance scores are selected for eviction (number EE) to conform to a specified cache budget CC.
  • After this one-time token-level pruning, the remaining tokens are partitioned into pages (blocks) of fixed size BB. No further token-level evictions occur; all subsequent actions are block-aligned.

B. Decode Phase (Autoregressive Generation)

  • After each generation step, when the current length LL satisfies LmodB=0L \bmod B = 0 (i.e., a new block has been completed), the algorithm computes a block importance score SjS_j for each block jj:

Sj=1Biblock jVi2Ki2S_j = \frac{1}{B} \sum_{i \in \text{block } j} \frac{\Vert V_i \Vert_2}{\Vert K_i \Vert_2}

  • The block with the lowest aggregated score is designated for eviction in its entirety.
  • Eviction is only triggered when a new block is filled, minimizing cache update frequency and maintaining block alignment.

Algorithmic Structure

The algorithm is specified in the paper's pseudocode (Algorithm 1–3):

  • Importance estimation pseudocode:
    1
    2
    3
    4
    
    for token in tokens:
        S[token] = norm(V[token]) / norm(K[token])
    for block in blocks:
        S[block] = average([S[token] for token in block])
  • Prefill phase:
    • Compute SiS_i for all tokens
    • Select and evict EE least important tokens to meet cache budget CC
    • Partition remaining tokens into blocks of size BB
  • Decode phase:
    • For every step where LmodB=0L \bmod B = 0:
    • Compute SjS_j for all blocks
    • Evict block with minimum SjS_j

3. Integration with vLLM PagedAttention

PagedEviction is explicitly designed for compatibility with PagedAttention, which organizes the KV cache into physical memory pages to avoid fragmentation and optimize memory allocation. The algorithm:

  • Evicts blocks strictly within page boundaries, matching the allocation and mapping of PagedAttention.
  • Performs importance computations using only key and value vectors, requiring no access to attention weights, which may not be available under optimized kernel implementations (e.g., FlashAttention).
  • Integrates without any changes to CUDA attention kernels—allowing immediate deployment in vLLM or similar systems.

This design ensures minimal impact on underlying memory management or inference pipelines and supports efficient error-free cache operation across varying context lengths.

4. Empirical Evaluation and Performance

The efficacy of PagedEviction was evaluated on Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct using the LongBench benchmark suite. Several findings are noted:

A. Accuracy Under Constraint

  • PagedEviction maintains higher model accuracy as measured by ROUGE score compared to StreamingLLM, KeyDiff, and token-level importance methods under tight cache budgets. For example, at a 1024-token cache budget on GovReport using Llama-3.2-1B, the algorithm achieves ROUGE ≈ 24.5—representing a 15–20% improvement over baselines.
  • On MultiNews (Llama-3.2-3B), PagedEviction achieves ROUGE ≈ 23.6 at budget 1024, outperforming unstructured Inverse Key L2-Norm by 1.1 points.

B. Throughput and Latency

  • Throughput increases substantially under constrained memory: Llama-1B achieves up to 3020 tokens/sec at budget 1024 (a ~37% increase over Full Cache's 2200 tokens/sec).
  • Latency per output token is decreased by 10–12% across all model scales.
  • These performance gains result from block-wise evictions minimizing update and fragmentation overhead.

C. Structural Cache Integrity

Block-wise eviction prevents fragmentation endemic to token-level eviction in paged memory layouts. It also reduces the complexity and frequency of cache modifications—leading to better system throughput and lower time per inference step.

5. Implications and Research Directions

PagedEviction advances memory management for LLM inference by providing:

  • A practical method for enforcing strict memory budgets without severely degrading model accuracy, particularly on long-context tasks.
  • Seamless integration with existing PagedAttention/out-of-core caching frameworks.
  • Significant throughput and latency improvements, validated across multiple model scales and datasets.

Future research directions proposed in the paper include:

  • Investigating more advanced proxy importance metrics or adaptive block scoring functions, potentially refining the balance between memory usage and inference quality.
  • Exploring layer-wise budget allocation and synergies with KV cache quantization or hybrid eviction strategies.
  • Dynamic adaptation of block size based on runtime inference statistics.

A plausible implication is that PagedEviction's principles could generalize to other structured memory management domains, including hierarchical block caches, distributed inference, and resource-constrained on-device deployment scenarios.

6. Methodological Significance

PagedEviction exemplifies the transition from unstructured to structure-aware cache management in machine learning systems. By synchronizing algorithmic eviction logic with memory subsystem boundaries and leveraging statistically meaningful proxy signals (ratio of V2/K2\Vert V \Vert_2 / \Vert K \Vert_2), the method provides a reproducible, explainable, and robust approach to KV cache pruning for stateful sequence modeling.

In summary, PagedEviction is a principled, empirically validated, block-wise KV cache pruning strategy that enhances LLM inference efficiency under tight memory constraints, fully compatible with paged memory layouts and requiring no kernel-level alteration—enabling scalable and accurate long-context generation.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PagedEviction.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube