Adaptive KV Aggregation in LLMs

Updated 11 November 2025

Adaptive KV Aggregation is a method for dynamically compressing and retaining key-value pairs in Transformer caches to maximize memory efficiency under long contexts.
It leverages techniques such as layer-wise selection, head-wise budgeting, and similarity-based merging to preserve essential information while reducing resource usage.
Empirical results demonstrate improved throughput and reduced memory consumption compared to uniform fixed-budget methods across varied tasks.

Adaptive Key-Value (KV) Aggregation refers to a suite of compression and retention strategies for the KV cache in Transformer-based LLMs and Vision-LLMs (VLMs), focused on maximizing inference efficiency under stringent memory constraints without compromising contextual fidelity. The KV cache—storing keys and values from prior tokens for efficient attention computation—rapidly becomes a computational bottleneck as input or output sequence length grows. Adaptive aggregation methods tackle this challenge by dynamically selecting, partitioning, merging, or scoring cache entries based on per-layer, per-head, semantic, or predictive criteria. These methods outperform fixed-budget (uniform, pyramid, sentence-level) approaches by aligning compression strategies with the actual context, activation, and importance profiles observed during inference across diverse tasks.

1. Motivations and Problem Formulation

Autoregressive inference in LLMs demands retention of all prior key and value projections in a multi-layer, multi-head Transformer. The naive FullKV policy scales memory linearly with context length, leading to prohibitive resource use (e.g., >50 GB for 100K tokens, LLaMA-2-7B (Zhou et al., 19 Dec 2024), >1 TB for GPT-3-class models). Standard compression methods (eviction, truncation) apply a uniform or monotonic budget across layers, heads, or tokens—often ignoring the heterogeneity in importance distributions induced by task, data modality, and layer-specific attentional behavior.

Recent research exposes two key deficiencies in classical approaches:

Non-uniform attention mass: Early layers and heads typically concentrate attention on a narrow subset of tokens; higher layers or specific tasks (e.g., code completion, multi-doc QA) require larger or more diffuse retention (Zhou et al., 19 Dec 2024, Zuo et al., 23 Mar 2025).
Semantic and temporal coherence: Fragmented token-level retention degrades linguistic or visual continuity, especially in instruction-following, summarization, or vision-language alignment (Zuo et al., 23 Mar 2025, Chen et al., 26 Oct 2025).

Adaptive KV aggregation thus frames the problem as a dynamic allocation under a global or local memory budget, maximizing a fidelity or importance objective: $\max_{\text{retention pattern}} \text{Information Fidelity} \quad \text{s.t.} \quad \text{Memory} \leq B$ Fidelity metrics vary: cumulative retained attention mass, perplexity decrease, retrieval accuracy, or ROUGE/F1.

2. Methodological Taxonomy

A diverse ecosystem of adaptive strategies has emerged, grouped as follows:

2.1. Layer-wise Adaptive Aggregation

PrefixKV (Wang et al., 4 Dec 2024) introduces per-layer prefix retention. Each layer $\ell$ scores tokens by attention or derived importance $s_{\ell,j}$ . The method globally searches for a threshold $p$ so that cumulative priorities $P_\ell(o)$ yield a desired compression ratio, allocates $R_\ell$ tokens per layer, and prunes at fixed distances during decoding. Layers with sharply concentrated scores receive smaller prefixes; "difficult" layers retain more.

2.2. Task-aware and Per-layer Budgeting

DynamicKV (Zhou et al., 19 Dec 2024) periodically recalibrates per-layer budgets according to task-specific activation maps. Budgets $B_\ell$ are distributed where observed attention concentration is highest, via periodic normalization over pooled attention scores. For example, summarization tasks favor low upper-layer budgets, whereas code completion demands fattened mid-to-upper layers.

2.3. Head-wise Adaptive Allocation

Ada-KV (Feng et al., 16 Jul 2024) optimizes budget per attention head, proving that global top- $B$ allocation strictly improves upper-bound attention mass retention compared to uniform head-wise splits. Adaptive selection is achieved by ranking all head-specific weights $A_i^j$ , assigning $B_i^*$ per head and smoothing via $\alpha$ -mix. This method generalizes to any eviction procedure (SnapKV, PyramidKV) and empirically yields higher accuracy at low budgets.

2.4. Merger-based Aggregation

KVMerger (Wang et al., 11 Jul 2024) pivots from pure eviction to similarity-driven merging. Adjacent key states in a sequence are greedily clustered by high cosine similarity ( $\epsilon$ threshold), forming persistent "merging sets." These are condensed by attention-weighted Gaussian-kernel aggregation, yielding representatives with minimal context loss. Empirical evidence shows robust task-wise performance even at moderate (35-50%) compression.

2.5. Predictive/Monte-Carlo Importance Sampling

GVote (Tang et al., 3 Sep 2025) replaces manual bucket tuning with union voting over sampled plausible future queries. Hidden states are modeled as Gaussian; samples yield queries whose top- $p$ attended keys are aggregated. The union size adapts naturally to the actual context complexity, delivering robust efficiency/accuracy trade-off across reasoning and retrieval tasks.

2.6. Semantic and Block-level Compression

SABlock (Chen et al., 26 Oct 2025) employs input segmentation (by punctuation) to group tokens into contiguous semantic blocks, then performs segment-guided attention scoring to upweight globally salient spans. For a fixed cache budget, each segment is adaptively assigned the coarsest block size preserving $\tau$ fraction of ideal segment attention mass. Critically, semantic continuity is respected even under severe memory reduction.

2.7. Evict-then-Merge Cascade

EMS (Li et al., 11 Dec 2024) proposes a hybrid framework: first evict "irrelevant" tokens (lowest in global-local score), then merge candidates into per-head class centers if redundancy (joint key/value similarity) exceeds $\tau$ , otherwise assign to a zero-class and evict. The global-local score mitigates positional or recency biases, boosting long-range dependency and reducing perplexity under extreme budgets.

2.8. Hierarchical Chunked Management (Hardware-aware)

LeoAM (Sun et al., 25 Jun 2025) extends aggregation to hierarchical GPU–CPU–Disk architectures, adaptively chunking tokens by attention mass bounds, and employing lightweight key abstractions for disk-side summarization. The system dynamically partitions and compresses KV chunks, minimizing transmission and storage overhead for commodity hardware while retaining near full-cache accuracy.

3. Algorithmic Details and Implementation

This section contrasts the main algorithmic innovations found in adaptive KV aggregation methods.

3.1. PrefixKV: Binary Search over Layer-wise Priority Curves

Compute importance scores $s_{\ell,j}$ for all tokens, per layer.
Define per-layer cumulative priority curve $P_\ell(o)$ .
Search for parameter $p$ so that $\sum_\ell R_\ell(p)=rL$ , using binary search with accuracy $\epsilon$ .
Post-prefill: sort scores, retain top $R_\ell N$ KV pairs per layer, prune oldest on decoding.

3.2. DynamicKV: Periodic Attention-driven Normalization

During prefill, score all tokens via pooled attention mechanisms.
Every $m$ layers, update per-layer budgets by distributing the global budget $B_{\text{global}}$ in proportion to recent attention mass.

3.3. Ada-KV: Global Top- $B$ Head-wise Allocation

Concatenate head-wise scores.
Pick global Top- $B$ entries.
Count per-head selections to get budget allocation.
Smoothing and plug-and-play into any eviction scheme.

3.4. KVMerger: Cosine-similarity Clustering and Gaussian Merging

For each head/layer, cluster contiguous tokens by similarity threshold $\epsilon$ .
Within cluster, select pivotal token by attention score.
Merge cluster via Gaussian weights decaying with embedding distance.

3.5. GVote: Future-Sampling and Voting Union

Sample $S$ synthetic future queries from fitted Gaussian of layer-norm hidden states.
For each, compute Top- $k$ attended keys.
Aggregate union for retention.

3.6. SABlock: Segment-wise Adaptive Block Selection

Segment compressible region by punctuation.
Score tokens within segment, boost by global segment strength.
For given global token budget, search for maximal block size per segment maintaining attention-fidelity $R_k(g)\geq \tau$ .

3.7. EMS: Evict-then-Merge with Global-Local Score and Zero-Class

Score tokens with per-head global-local attention metrics.
Evict lowest ranked; merge similar TBM candidates into centers if redundancy threshold met.

3.8. LeoAM: Tiered Chunk Partitioning and Lightweight Abstraction

Partition tokens into variable-sized chunks, assign to GPU, CPU, disk by dynamic bound evaluations.
Summarize disk chunks by max/min key-vectors.
Schedule compression, decompression and transfer to match compute pattern.

4. Empirical Results and Performance Benchmarks

Adaptive aggregation methods systematically outperform fixed-budget baselines across major long-context QA, summarization, synthetic retrieval, and code tasks.

4.1. Layer/Head Adaptive Methods

PrefixKV (Wang et al., 4 Dec 2024): At 20% cache, up to 1.8× throughput improvement on LLaVA-7B, avoids out-of-memory in high batch settings.
DynamicKV (Zhou et al., 19 Dec 2024): At 1.7% cache, 85–90% of full accuracy; at 0.9%, 11% higher than SOTA needle retrieval.

4.2. Block/Window/Semantic Aggregation

WindowKV (Zuo et al., 23 Mar 2025): 12% cache, <1 point performance drop, 17% speedup.
SABlock (Chen et al., 26 Oct 2025): 99.9% retrieval at 96 entries, 46% memory reduction, 9.5× speedup at 128K context.

4.3. Similarity/Merge-based Methods

KVMerger (Wang et al., 11 Jul 2024): 50% budget, within 0.6% F1 of full cache; >90% retrieval accuracy, outperforms eviction under tight budgets.
EMS (Li et al., 11 Dec 2024): At 2% context budget, 95.9% retrieval, lowest perplexity, 3.9–6.7× speedup, supports large batches without OOM.

4.4. Predictive Sampling

GVote (Tang et al., 3 Sep 2025): 2× memory reduction, matching or exceeding baseline quality on GSM8K, RULER, LongBench.

4.5. Hardware-aware Aggregation

LeoAM (Sun et al., 25 Jun 2025): <1 pp accuracy drop at 25% cache, average speedup 3.46×, up to 5.47× at batch 8.

5. Analysis, Limitations, and Future Research Directions

5.1. Design Trade-offs

Layer/head adaptive methods maximize attention-mass retention but may require offline or online curve fitting.
Merge-based schemes preserve semantic continuity, but cluster threshold selection ( $\epsilon$ , redundancy $\tau$ ) introduces sensitivity.
Predictive voting is robust to workload variance, yet its accuracy depends on hidden-state modeling quality.
Segment/block methods (SABlock) optimally resolve semantic boundaries, balancing block size and attention-fidelity.

5.2. Limitations

Most current methods rely on fixed hyperparameters or thresholding; per-head/layer dynamic tuning is not always considered.
Hardware-aware strategies are bounded by disk I/O, with abstractions limited to max/min sketches.
Merge and segment methods may introduce approximation, especially in highly non-stationary or cross-modal contexts.

5.3. Research Directions

Distributionally robust risk control for cache eviction (DefensiveKV, Layer-DefensiveKV (Feng et al., 15 Oct 2025))—shifting from average-case to worst-case utility.
Orthogonal integration with quantization and sparse attention for further efficiency gains.
Extension to non-transformer, retrieval-augmented, and encoder–decoder architectures.
Per-layer/per-head adaptive parameter learning, possibly integrated into continual learning or fine-tuning paradigms.

6. Controversies and Interpretation

A common misconception is that uniform truncation suffices for any long-context task. Counter-evidence from activation distribution profiling (Zhou et al., 19 Dec 2024) and semantic retention studies (Chen et al., 26 Oct 2025) demonstrates non-uniform needs across layers, heads, and tasks. Another point of contention centers on the reliability of attention-based importance scores; alternatives (e.g., value-norm adjustment as in CriticalKV, (Feng et al., 15 Oct 2025)) supplement aggregation for downstream accuracy. The trade-off between semantic coherence (block/window) and token-level granularity remains a focus for further studies.

7. Summary Table: Major Adaptive KV Aggregation Strategies

Approach	Key Innovations	Notable Performance/Trade-off
PrefixKV	Layer-wise adaptive prefix search	1.8× speedup at 20% cache, robust quality
DynamicKV	Task-aware per-layer allocation	85–90% accuracy at 1.7% cache; 11% SOTA gain
Ada-KV	Head-wise top- $B$ allocation	+1.2–1.6 pts avg. over uniform baselines
KVMerger	Cosine-similarity cluster merge	>0.9 F1 near full-cache at 35–50% retention
GVote	Predictive future query voting	2× memory reduction, robust across workloads
SABlock	Segment-guided block selection	99.9% retrieval at 96 entries; 9.5× faster
EMS	Global-local evict-then-merge	95.9% retrieval at 2% budget; lowest perplexity
LeoAM	Tiered chunked management	3.46× avg speedup, <1 pp accuracy drop
DefensiveKV	Worst-case risk aggregation	2.3–4.3× less quality loss @ 20% cache

Adaptive KV aggregation is a dynamically evolving area that underpins practical LLM/VLM deployment on constrained hardware. Its core principle is the allocation, selection, or merging of cache entries according to real, context-sensitive utility, empirically maximizing inference speedup and memory reduction with minimal fidelity loss.