Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Block-wise Prefill and Episodic KV Compression

Updated 28 September 2025
  • The paper demonstrates that block-wise prefill bounds KV cache memory by processing fixed-size blocks with immediate eviction, ensuring scalability.
  • Block-wise prefill and episodic KV compression are strategies that segment input and conversational history to reduce memory overhead in large language models.
  • Empirical results reveal memory reductions up to 3.5× and throughput improvements of 37%, highlighting practical benefits for efficient LLM inference.

Block-wise prefill and episodic KV compression are key strategies for controlling the memory overhead of key-value (KV) caching in transformer-based LLM inference, especially in settings involving long sequences and multi-turn interactions. As LLMs’ context windows grow, the KV cache rapidly becomes a dominant memory and computation bottleneck, motivating a range of structurally-aware, throughput-optimized compression and eviction algorithms. Below is a comprehensive overview of the principles, methods, and empirical advances in block-wise prefill and episodic KV compression that integrate findings from leading work in the literature.

1. Foundations: Motivation and Structural Framework

KV caching enables LLMs to reuse the outputs of previously processed tokens, avoiding redundant attention computations. The cache size increases linearly with the product of context length, number of layers, and number of attention heads, ultimately imposing scaling and multi-user concurrency constraints due to limited GPU resources. Traditional token-level pruning and sliding window methods sparsify the sequence dimension but do not address the inherent structure of modern inference frameworks, such as the paged/block-wise memory layout (e.g., vLLM’s PagedAttention), nor do they fully exploit the dependencies and redundancies present across attention heads, layers, and conversation episodes (Rehg, 30 Sep 2024, Chitty-Venkata et al., 4 Sep 2025, Kim et al., 22 Sep 2025).

Block-wise prefill refers to segmenting the input or conversational history into fixed-size blocks and applying cache-building and eviction operations immediately after processing each block, thus bounding memory growth throughout inference. Episodic KV compression extends this by grouping context (or conversational turns) into semantically coherent “episodes” and applying cache pruning — commonly at the block or page level — specific to those episodes, optimizing relevance and reducing memory footprint without compromising accuracy in multi-turn question-answering or reasoning tasks (Kim et al., 22 Sep 2025).

2. Block-wise Prefill: Algorithms and Memory Control

Block-wise prefill guarantees that the peak memory usage of the KV cache is strictly limited, regardless of raw input length. The process operates in an incremental cycle:

  • The context is divided into blocks of tokens, each of size MblockM_{block}.
  • After prefilling (i.e., running the transformer stack over one block and updating the KV cache), an immediate eviction step occurs, pruning the cache back to a fixed budget MM before proceeding to the next block (Kim et al., 22 Sep 2025, Chitty-Venkata et al., 4 Sep 2025).
  • Eviction is typically driven by token or block importance scores, which can be based on attention weights, norm ratios (e.g., Vi2/Ki2\|\mathbf{V}_i\|_2 / \|\mathbf{K}_i\|_2), or aggregated attentions with task-aware prompts (Chitty-Venkata et al., 4 Sep 2025).

A representative algorithm is as follows:

  1. For each block, process block tokens to expand the KV cache.
  2. Compute per-token or per-block importance scores (e.g., by max-attention from a “patched prompt” appended after the block).
  3. Evict the least important tokens or entire pages to maintain the fixed budget.
  4. Proceed to next block, repeating steps 1-3.

Constructed this way, block-wise prefill prevents unbounded memory spikes typical of approaches that compress only after full-context prefill and provides early eligibility for memory reuse in parallel, high-throughput inference servers (Chitty-Venkata et al., 4 Sep 2025).

3. Episodic KV Compression: Contextual and Semantic Partitioning

Episodic KV compression targets the challenge of selectively preserving topic-relevant or semantically coherent context in long multi-turn conversations. The conversation is partitioned into “episodes” (subsets of conversational history clustered according to semantic similarity, e.g., via K-means or sentence embedding cosine similarity) (Kim et al., 22 Sep 2025). Each episode constructs a compressed KV cache specific to its content (often selecting the medoid—a central, most representative segment).

The process includes:

  • Segmenting the history into groups of utterances; embedding and clustering these to discover episodes.
  • For each episode, building an “episodic” KV cache by keeping the most attention-relevant tokens with respect to its medoid prompt (again, using attention-guided eviction).
  • During decoding, embedding the current query and selecting the most semantically similar episodic cache, thereby focusing both memory and compute only on episode-relevant context.

This approach maintains high answer accuracy by preventing the “query-narrowing” effect observed in baseline methods that retain only the subset of the cache most relevant to a single query and can enable memory reductions of 4–6× without significant loss (Kim et al., 22 Sep 2025).

4. Layer-wise and Block-wise Adaptive Budget Allocation

Adaptive budgeting further refines memory efficiency by allocating more cache resources to layers or blocks shown empirically to be more sensitive to eviction. Sensitivity is often measured by the cosine similarity between Key states computed with and without block-based pruning; layers with higher sensitivity (greater change under pruning) receive a larger share of the cache budget:

Malloc()=sαj=1Lsjα(LM)M_\text{alloc}^{(\ell)} = \frac{s_\ell^\alpha}{\sum_{j=1}^{L} s_j^\alpha} (L \cdot M)

where ss_\ell is the measured sensitivity for layer \ell and α\alpha controls sharpness of allocation (Kim et al., 22 Sep 2025). Similar strategies are adopted in structured block-wise pruning approaches, which aggregate token or block importance metrics at the block/page level and evict the least informative blocks as cache budgets are reached (Chitty-Venkata et al., 4 Sep 2025). This kind of allocation ensures the preservation of critical model capacity where it is most impactful, reducing the risk of performance loss.

5. Structured Block-wise and Page-level Eviction

Paged or block-wise memory layouts, as in vLLM’s PagedAttention, facilitate efficient cache eviction and recycling while preserving hardware compatibility and high GPU utilization. Instead of evicting tokens scattered throughout the memory pool, entire blocks (pages) are evicted, which:

  • Maintains physical contiguity in memory, reducing fragmentation and housekeeping overhead.
  • Avoids the need to reorder or reconstruct tensor layouts, ensuring compatibility with pre-compiled CUDA kernels for attention computation (Chitty-Venkata et al., 4 Sep 2025).
  • Requires only minimal auxiliary metadata and can be implemented without changes to low-level inference engine internals.

The block-wise eviction algorithm may be summarized as:

  • In the decode phase, periodically (e.g., every time a new block fills), calculate an aggregated importance score per block and evict the block with lowest score.
  • During prefill, perform token-level selection first, then partition into blocks, avoiding reordering post-allocation (Chitty-Venkata et al., 4 Sep 2025).

These techniques yield throughput improvements of up to 37%, latency reductions of 10–12%, and maintain full-cache accuracy in most common scenarios.

6. Performance and Empirical Outcomes

Block-wise and episodic KV compression approaches consistently outperform token-level and global post-prefill methods in both memory efficiency and generation quality on diverse benchmarks. Empirical results from recent works indicate:

  • Memory usage reduced by up to 3.5x (e.g., EpiCache (Kim et al., 22 Sep 2025), PagedEviction (Chitty-Venkata et al., 4 Sep 2025)).
  • Decoding latency improvements of 2.4x and throughput enhancements up to 37% over full-cache baselines.
  • Sustained high accuracy even under aggressive (e.g., 4–6x) compression regimes on multi-turn QA, summarization, and long-context understanding tasks.
  • Stability and deployment feasibility due to structured, block/page-aligned operations compatible with popular inference engines.

The table below summarizes representative results:

Method/Work Peak Memory Reduction Latency/Throughput Accuracy Loss
EpiCache (Kim et al., 22 Sep 2025) 3.5× 2.4× faster ≤10% at 4–6× compress.
PagedEviction (Chitty-Venkata et al., 4 Sep 2025) up to 2× 37% ↑ throughput Minimal, near full-KV
KV-Compress (Rehg, 30 Sep 2024) up to 8× 5× ↑ throughput <10% at 8× compress.

These empirical findings highlight the practicality and robustness of block-wise and episodic strategies in production environments.

7. Integration, Practical Implications, and Extensions

Block-wise prefill and episodic KV compression provide a principled framework to balance memory usage, throughput, and answer accuracy. Notable aspects include:

  • Compatibility with hardware paged memory designs (like vLLM’s PagedAttention), enabling high GPU occupancy and minimal kernel customization.
  • Efficient management of conversational histories and multi-turn interactions under fixed memory constraints by clustering and encoding only the most relevant context per episode (Kim et al., 22 Sep 2025).
  • Support for large batch sizes and long sequence inference, which is essential for modern LLM deployment scenarios such as conversational search, chatbots, or long-document summarization.

Emerging research directions include adaptive per-task or per-user budgeting, integration of quantization or cross-layer sharing with block-wise pruning, and fine-grained access patterns for knowledge-intensive or retrieval-augmented LLMs.

References (by arXiv ID)

  • (Rehg, 30 Sep 2024): "KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head"
  • (Chitty-Venkata et al., 4 Sep 2025): "PagedEviction: Structured Block-wise KV Cache Pruning for Efficient LLM Inference"
  • (Kim et al., 22 Sep 2025): "EpiCache: Episodic KV Cache Management for Long Conversational Question Answering"
  • Additional context from related works on layer- and episode-adaptive budgeting, semantic- and block-level chunking, and block-structured pruning.

Block-wise prefill and episodic KV compression represent state-of-the-art solutions for scalable, efficient, and accurate LLM inference with long and dynamic contexts. Their adoption in modern systems underpins practical deployment of large models in resource-constrained and latency-sensitive settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Block-wise Prefill and Episodic KV Compression.