Papers
Topics
Authors
Recent
Search
2000 character limit reached

Amortized KV Cache Compaction

Updated 12 June 2026
  • Amortized KV cache compaction is a technique to distribute the overhead of cache compression across tokens, ensuring low per-token cost during transformer inference.
  • It leverages methods such as merging, quantization, and reconstruction to maintain robust generation quality under strict memory budgets.
  • This approach enables scalable transformer models with longer contexts and higher throughput by controlling memory growth while preserving computational efficiency.

Amortized KV Cache Compaction

Amortized KV cache compaction refers to a family of methodologies designed to reduce the memory consumption of the key-value (KV) cache in transformer decoders during inference, while ensuring that per-token overhead remains low and generation quality remains robust. The amortization principle spreads the computational cost of compaction—via merging, reconstruction, quantization, eviction, or synthesis—across many tokens or steps, ensuring practical deployment at scale. Amortized strategies distinguish themselves from naïve greedy or one-shot techniques by careful system design, algorithm selection, or learned encoding, enabling consistent throughput and feasibility for long-context and high-throughput inference.

1. Formal Principles and Compaction Objectives

The transformer decoder's KV cache, storing all previous token-wise key and value pairs for each layer and head, grows linearly with context length TT and quickly dominates the serving footprint: for example, an 8K-token context in LLaMA-3-70B uses several hundred GB of GPU memory if left uncompressed (Tian et al., 14 Apr 2025). Amortized compaction techniques seek to bound this memory growth—often to a strict budget R(0,1]R \in (0,1] of the full cache—while guaranteeing that the cost of compaction plus retrieval is distributed sufficiently across tokens, i.e., per-token amortized overhead remains O(1)O(1), and generation accuracy is preserved.

A loss-based formalism is common: Given a memory budget, the objective is to produce a compressed cache K,VK', V' such that for every generation step tt, the output ot=softmax(qtK)Vo_t = \mathrm{softmax}(q_t {K'}^\top)V' closely approximates the full-cache output. In many recent frameworks, the compaction operation is either stateless (query-agnostic, e.g., via token importance ranking) or lightweight per-context (e.g., amortized residual reconstruction), often implemented with explicit guarantees on the stability of the output distribution or end-to-end error (O'Neill et al., 5 Jun 2026, Tian et al., 14 Apr 2025, Wang et al., 24 Mar 2026).

2. Methodological Taxonomy

Amortized KV cache compaction encompasses multiple algorithmic classes, each designed to minimize both memory and computational cost while ensuring that compaction work does not become a bottleneck:

2.1 Merging-based and Vote-aware Compaction

KeepKV (Tian et al., 14 Apr 2025) introduces a merging strategy that records “Electoral Votes” and utilizes Zero Inference-Perturbation (ZIP) Merging, ensuring no output perturbation upon compaction. Each time two KV entries are merged, their vote counts pip_i are accumulated, and attention is reweighted proportionally. A closed-form ZIP rule recalculates merged krk_r and vrv_r such that exact attention consistency is maintained—a feature analytically shown via Theorem 3.1. The merging and reweighting are applied infrequently (amortized over the token sequence), resulting in per-token overhead O(d)O(d) and negligible generation quality loss at budgets as tight as 10% of the full cache.

2.2 Reconstruction-based Amortization

EchoKV (Wang et al., 24 Mar 2026) performs groupwise cache dropout (e.g. discarding the majority of heads in selected layers) and reconstructs dropped components on-demand via ultra-lightweight, trained linear layers. Inference amortizes the cost of pre-fill compression and per-token residual reconstruction: for each token, a small number of linear transformations are performed, incurring R(0,1]R \in (0,1]0 overhead per token per group. This approach leverages inter- and intra-layer similarity and maintains accuracy across high compression ratios.

DeltaKV (Hao et al., 8 Feb 2026) recasts compaction as residual encoding: each new KV pair is stored as a low-dimensional latent R(0,1]R \in (0,1]1 representing the residual relative to a small strided reference set, with decompression invoked only for the sparse subset of tokens used as attention references. The overall memory footprint and computational overhead grow sublinearly, and the process is inherently amortized as decompression is required for only a fraction R(0,1]R \in (0,1]2 of tokens.

2.3 Blockwise and Region-aware Token Eviction

Batch-Max (Metel et al., 2024) implements blockwise eviction at both pre-fill and decode stages, maintaining running sums of attention weights and pruning every R(0,1]R \in (0,1]3 tokens by evicting the least-averaged slots. All maintenance and eviction computations occur in constant time for each block, yielding amortized R(0,1]R \in (0,1]4 per-token cost (with R(0,1]R \in (0,1]5 the cache size after compaction). Peak memory use per sample/head scales as R(0,1]R \in (0,1]6, enabling up to 3–10× larger batches.

AMS (Adaptive Mass-Segmented) (Yang et al., 22 May 2026) segments tokens based on cumulative attention mass and allocates retention quotas per region, enforcing region awareness and guaranteed minimum coverage. The segmentation boundaries are smoothed with exponential moving averages, and compaction routines (in-segment top-R(0,1]R \in (0,1]7) operate at intervals R(0,1]R \in (0,1]8 on R(0,1]R \in (0,1]9 cost, keeping per-token overhead O(1)O(1)0 for appropriately chosen O(1)O(1)1.

2.4 Streaming and Discrepancy-based Summarization

BalanceKV (Han et al., 11 Feb 2025) introduces a streaming, discrepancy-minimizing merge-and-reduce algorithm using Banaszczyk's vector-balancing walk. At each batch, a half-split is performed recursively, ensuring that the number of stored pairs is kept at O(1)O(1)2 to achieve an O(1)O(1)3-approximate attention. The cost per token, amortized over depth, is O(1)O(1)4, with empirical results outperforming all i.i.d. sampling-based cache pruning methods.

2.5 Learning-based Synthesis

Still (O'Neill et al., 5 Jun 2026) employs a per-layer small Perceiver network trained offline, which, after observing a full per-layer KV cache, synthesizes compact keys and values for all heads in a single forward pass. The amortization arises because the Perceiver compactor is called once per layer per chunk, not per token, and is applicable to arbitrarily long contexts by iterative application. The design allows for reuse across architectures and chunk sizes, with latency remaining sublinear in context length and exact workload.

3. Complexity and Amortization Analysis

Across methods, amortized compaction achieves per-token cost O(1)O(1)5 or lower by ensuring only infrequent, constant-time kernel launches, periodic groupwise selection, or fast local reconstructions. The full computational costs are typically:

  • Total compaction time per O(1)O(1)6 tokens: O(1)O(1)7, with O(1)O(1)8.
  • Memory usage after compaction: O(1)O(1)9 for budget K,VK', V'0 (10–50%), plus marginal metainformation (vote counts, latent indices, or quantization schemes).
  • Amortized cost per token: For lazy or blockwise schemes, K,VK', V'1, with K,VK', V'2 as K,VK', V'3.
  • No-gather, slot-reuse: ThinKV (Ramachandran et al., 1 Oct 2025) and LeanKV (Zhang et al., 2024) replace gather-based K,VK', V'4 compaction with O(1)-amortized in-place page/slot reuse through GPU-resident block tables and page coordination, eliminating spike compactions entirely.

System-level optimizations, e.g., fused decompression+mat-vec in KVComp (Jiang et al., 30 Aug 2025), further ensure that decompress and attention steps are jointly executed, minimizing memory movement and maximizing bandwidth in memory-bound regimes.

4. Empirical Outcomes

Amortized compaction methods consistently enable 2–10× memory reductions at negligible or zero impact on generation quality for summarization, QA, code and mathematical reasoning, and batch throughput:

Method Achievable Budget Quality Retention Throughput Gain Notable Result
KeepKV 10% >95% ROUGE ~2× 0 output perturbation at merge step (Tian et al., 14 Apr 2025)
EchoKV 30–50% >98% avg score up to 2× SOTA on LongBench/RULER (Wang et al., 24 Mar 2026)
DeltaKV 29% Near-lossless up to 2× 512K context, 2× vLLM throughput (Hao et al., 8 Feb 2026)
AMS-TOVA 10–20% +3–13 points Slightly better Eliminates Region Wipe-out, up to 44s decode/sample (Yang et al., 22 May 2026)
LeanKV 18–26% (perf.-neutral) ≥99.5% acc 1.9–2.5× 5–6× at <5% drop, 2–3× GPU gain (Zhang et al., 2024)
ThinKV 2.5% Near-lossless ~5.8× No O(N) compaction, <1% per-step overhead (Ramachandran et al., 1 Oct 2025)
Still 8–200× compress On Pareto front Sublinear, 0.4s +8–22 pts over baselines (128K context) (O'Neill et al., 5 Jun 2026)

Amortized schemes also tend to be more robust to changes in head importance, context nonstationarity, batch size, and memory pressure than non-amortized, greedy, or one-shot techniques.

5. Limitations, System Integration, and Extension Pathways

While amortized strategies excel in steady-state, several limitations are recognized:

  • Attention score non-locality: Rapid or context-shifting attention may impair the predictive accuracy of heuristics or EMA predictors (e.g., KeepKV's multi-step bounds are tightest under locality).
  • Region fragmentation: Token-level compaction can fragment chains of thought—handled by AMS, but still a concern for heavily interleaved contexts.
  • ANN search scaling: Residual-based schemes (DeltaKV) may need ANN retrieval or reference hierarchy for million-token scales.
  • Chunk-iterative horizon: Learned synthesizers (Still) must be trained across the intended context length horizon; naive extrapolation can degrade sharply at lengths beyond training.

Plug-and-play compatibility is emphasized in recent frameworks: AMS, LeanKV, and GraphKV explicitly wrap arbitrary baseline eviction or scoring methods and integrate with paged-KV engines (e.g., vLLM) via API contracts that emit gather indices or block-table rewrites.

Extensions include: adaptive thresholding per head/layer, hybrid or hierarchical merging strategies, non-linear or attention-aware residual decoders, and online adaptation of compaction/prediction parameters guided by empirical output divergence.

6. Concluding Synthesis

Amortized KV cache compaction unifies a spectrum of memory-reduction strategies under a shared principle: ensure that the high overhead of compaction is distributed across many tokens or requests, enabling efficient, scalable, and accurate long-context inference. Electing which strategy to use depends on specific deployment constraints: whether output perturbation can be tolerated (KeepKV for K,VK', V'5), how reversible compression must be (EchoKV’s fallback to full cache), compatibility with paged memory management (LeanKV, ThinKV), or need for layer-wise synthesis (Still). The development of amortized compaction underpins progress toward practical multi-hundred-K-token LLMs, enabling both batch scaling and low-latency reasoning on resource-bound hardware (Tian et al., 14 Apr 2025, Wang et al., 24 Mar 2026, Hao et al., 8 Feb 2026, Metel et al., 2024, Yang et al., 22 May 2026, Ramachandran et al., 1 Oct 2025, Zhang et al., 2024, O'Neill et al., 5 Jun 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amortized KV Cache Compaction.