Amortized KV Cache Compaction
- Amortized KV cache compaction is a technique to distribute the overhead of cache compression across tokens, ensuring low per-token cost during transformer inference.
- It leverages methods such as merging, quantization, and reconstruction to maintain robust generation quality under strict memory budgets.
- This approach enables scalable transformer models with longer contexts and higher throughput by controlling memory growth while preserving computational efficiency.
Amortized KV Cache Compaction
Amortized KV cache compaction refers to a family of methodologies designed to reduce the memory consumption of the key-value (KV) cache in transformer decoders during inference, while ensuring that per-token overhead remains low and generation quality remains robust. The amortization principle spreads the computational cost of compaction—via merging, reconstruction, quantization, eviction, or synthesis—across many tokens or steps, ensuring practical deployment at scale. Amortized strategies distinguish themselves from naïve greedy or one-shot techniques by careful system design, algorithm selection, or learned encoding, enabling consistent throughput and feasibility for long-context and high-throughput inference.
1. Formal Principles and Compaction Objectives
The transformer decoder's KV cache, storing all previous token-wise key and value pairs for each layer and head, grows linearly with context length and quickly dominates the serving footprint: for example, an 8K-token context in LLaMA-3-70B uses several hundred GB of GPU memory if left uncompressed (Tian et al., 14 Apr 2025). Amortized compaction techniques seek to bound this memory growth—often to a strict budget of the full cache—while guaranteeing that the cost of compaction plus retrieval is distributed sufficiently across tokens, i.e., per-token amortized overhead remains , and generation accuracy is preserved.
A loss-based formalism is common: Given a memory budget, the objective is to produce a compressed cache such that for every generation step , the output closely approximates the full-cache output. In many recent frameworks, the compaction operation is either stateless (query-agnostic, e.g., via token importance ranking) or lightweight per-context (e.g., amortized residual reconstruction), often implemented with explicit guarantees on the stability of the output distribution or end-to-end error (O'Neill et al., 5 Jun 2026, Tian et al., 14 Apr 2025, Wang et al., 24 Mar 2026).
2. Methodological Taxonomy
Amortized KV cache compaction encompasses multiple algorithmic classes, each designed to minimize both memory and computational cost while ensuring that compaction work does not become a bottleneck:
2.1 Merging-based and Vote-aware Compaction
KeepKV (Tian et al., 14 Apr 2025) introduces a merging strategy that records “Electoral Votes” and utilizes Zero Inference-Perturbation (ZIP) Merging, ensuring no output perturbation upon compaction. Each time two KV entries are merged, their vote counts are accumulated, and attention is reweighted proportionally. A closed-form ZIP rule recalculates merged and such that exact attention consistency is maintained—a feature analytically shown via Theorem 3.1. The merging and reweighting are applied infrequently (amortized over the token sequence), resulting in per-token overhead and negligible generation quality loss at budgets as tight as 10% of the full cache.
2.2 Reconstruction-based Amortization
EchoKV (Wang et al., 24 Mar 2026) performs groupwise cache dropout (e.g. discarding the majority of heads in selected layers) and reconstructs dropped components on-demand via ultra-lightweight, trained linear layers. Inference amortizes the cost of pre-fill compression and per-token residual reconstruction: for each token, a small number of linear transformations are performed, incurring 0 overhead per token per group. This approach leverages inter- and intra-layer similarity and maintains accuracy across high compression ratios.
DeltaKV (Hao et al., 8 Feb 2026) recasts compaction as residual encoding: each new KV pair is stored as a low-dimensional latent 1 representing the residual relative to a small strided reference set, with decompression invoked only for the sparse subset of tokens used as attention references. The overall memory footprint and computational overhead grow sublinearly, and the process is inherently amortized as decompression is required for only a fraction 2 of tokens.
2.3 Blockwise and Region-aware Token Eviction
Batch-Max (Metel et al., 2024) implements blockwise eviction at both pre-fill and decode stages, maintaining running sums of attention weights and pruning every 3 tokens by evicting the least-averaged slots. All maintenance and eviction computations occur in constant time for each block, yielding amortized 4 per-token cost (with 5 the cache size after compaction). Peak memory use per sample/head scales as 6, enabling up to 3–10× larger batches.
AMS (Adaptive Mass-Segmented) (Yang et al., 22 May 2026) segments tokens based on cumulative attention mass and allocates retention quotas per region, enforcing region awareness and guaranteed minimum coverage. The segmentation boundaries are smoothed with exponential moving averages, and compaction routines (in-segment top-7) operate at intervals 8 on 9 cost, keeping per-token overhead 0 for appropriately chosen 1.
2.4 Streaming and Discrepancy-based Summarization
BalanceKV (Han et al., 11 Feb 2025) introduces a streaming, discrepancy-minimizing merge-and-reduce algorithm using Banaszczyk's vector-balancing walk. At each batch, a half-split is performed recursively, ensuring that the number of stored pairs is kept at 2 to achieve an 3-approximate attention. The cost per token, amortized over depth, is 4, with empirical results outperforming all i.i.d. sampling-based cache pruning methods.
2.5 Learning-based Synthesis
Still (O'Neill et al., 5 Jun 2026) employs a per-layer small Perceiver network trained offline, which, after observing a full per-layer KV cache, synthesizes compact keys and values for all heads in a single forward pass. The amortization arises because the Perceiver compactor is called once per layer per chunk, not per token, and is applicable to arbitrarily long contexts by iterative application. The design allows for reuse across architectures and chunk sizes, with latency remaining sublinear in context length and exact workload.
3. Complexity and Amortization Analysis
Across methods, amortized compaction achieves per-token cost 5 or lower by ensuring only infrequent, constant-time kernel launches, periodic groupwise selection, or fast local reconstructions. The full computational costs are typically:
- Total compaction time per 6 tokens: 7, with 8.
- Memory usage after compaction: 9 for budget 0 (10–50%), plus marginal metainformation (vote counts, latent indices, or quantization schemes).
- Amortized cost per token: For lazy or blockwise schemes, 1, with 2 as 3.
- No-gather, slot-reuse: ThinKV (Ramachandran et al., 1 Oct 2025) and LeanKV (Zhang et al., 2024) replace gather-based 4 compaction with O(1)-amortized in-place page/slot reuse through GPU-resident block tables and page coordination, eliminating spike compactions entirely.
System-level optimizations, e.g., fused decompression+mat-vec in KVComp (Jiang et al., 30 Aug 2025), further ensure that decompress and attention steps are jointly executed, minimizing memory movement and maximizing bandwidth in memory-bound regimes.
4. Empirical Outcomes
Amortized compaction methods consistently enable 2–10× memory reductions at negligible or zero impact on generation quality for summarization, QA, code and mathematical reasoning, and batch throughput:
| Method | Achievable Budget | Quality Retention | Throughput Gain | Notable Result |
|---|---|---|---|---|
| KeepKV | 10% | >95% ROUGE | ~2× | 0 output perturbation at merge step (Tian et al., 14 Apr 2025) |
| EchoKV | 30–50% | >98% avg score | up to 2× | SOTA on LongBench/RULER (Wang et al., 24 Mar 2026) |
| DeltaKV | 29% | Near-lossless | up to 2× | 512K context, 2× vLLM throughput (Hao et al., 8 Feb 2026) |
| AMS-TOVA | 10–20% | +3–13 points | Slightly better | Eliminates Region Wipe-out, up to 44s decode/sample (Yang et al., 22 May 2026) |
| LeanKV | 18–26% (perf.-neutral) | ≥99.5% acc | 1.9–2.5× | 5–6× at <5% drop, 2–3× GPU gain (Zhang et al., 2024) |
| ThinKV | 2.5% | Near-lossless | ~5.8× | No O(N) compaction, <1% per-step overhead (Ramachandran et al., 1 Oct 2025) |
| Still | 8–200× compress | On Pareto front | Sublinear, 0.4s | +8–22 pts over baselines (128K context) (O'Neill et al., 5 Jun 2026) |
Amortized schemes also tend to be more robust to changes in head importance, context nonstationarity, batch size, and memory pressure than non-amortized, greedy, or one-shot techniques.
5. Limitations, System Integration, and Extension Pathways
While amortized strategies excel in steady-state, several limitations are recognized:
- Attention score non-locality: Rapid or context-shifting attention may impair the predictive accuracy of heuristics or EMA predictors (e.g., KeepKV's multi-step bounds are tightest under locality).
- Region fragmentation: Token-level compaction can fragment chains of thought—handled by AMS, but still a concern for heavily interleaved contexts.
- ANN search scaling: Residual-based schemes (DeltaKV) may need ANN retrieval or reference hierarchy for million-token scales.
- Chunk-iterative horizon: Learned synthesizers (Still) must be trained across the intended context length horizon; naive extrapolation can degrade sharply at lengths beyond training.
Plug-and-play compatibility is emphasized in recent frameworks: AMS, LeanKV, and GraphKV explicitly wrap arbitrary baseline eviction or scoring methods and integrate with paged-KV engines (e.g., vLLM) via API contracts that emit gather indices or block-table rewrites.
Extensions include: adaptive thresholding per head/layer, hybrid or hierarchical merging strategies, non-linear or attention-aware residual decoders, and online adaptation of compaction/prediction parameters guided by empirical output divergence.
6. Concluding Synthesis
Amortized KV cache compaction unifies a spectrum of memory-reduction strategies under a shared principle: ensure that the high overhead of compaction is distributed across many tokens or requests, enabling efficient, scalable, and accurate long-context inference. Electing which strategy to use depends on specific deployment constraints: whether output perturbation can be tolerated (KeepKV for 5), how reversible compression must be (EchoKV’s fallback to full cache), compatibility with paged memory management (LeanKV, ThinKV), or need for layer-wise synthesis (Still). The development of amortized compaction underpins progress toward practical multi-hundred-K-token LLMs, enabling both batch scaling and low-latency reasoning on resource-bound hardware (Tian et al., 14 Apr 2025, Wang et al., 24 Mar 2026, Hao et al., 8 Feb 2026, Metel et al., 2024, Yang et al., 22 May 2026, Ramachandran et al., 1 Oct 2025, Zhang et al., 2024, O'Neill et al., 5 Jun 2026).