Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast KV Dimensionality Compression (FDC)

Updated 28 March 2026
  • FDC is a framework for reducing the memory and computational footprint of transformer key-value caches during inference by compressing keys and values.
  • It employs low-rank projections, quantization, and token/slice-selective pruning to manage long-context inputs and high-throughput prompts while preserving accuracy.
  • Practical FDC techniques enable scalable LLM and multimodal model deployments on resource-constrained hardware with minimal performance loss.

Fast KV Dimensionality Compression (FDC) defines a class of algorithmic frameworks and practical techniques for reducing the memory, bandwidth, and computational cost of Key–Value (KV) caches in large transformer models during inference. The primary goal is to allow LLMs and large multimodal models to serve long-context inputs, high-throughput prompts, or resource-constrained deployments by minimizing the dimensional, structural, or bitwidth footprint of the cached K (keys) and V (values) tensors across layers and heads. FDC encompasses a spectrum of approaches, including low-rank projection, quantization, selective token retention, similarity-based reuse, and joint latent-space representations, each often accompanied by strategies that prioritize information preservation or adaptive allocation under memory or accuracy constraints.

1. Foundations and Motivation

The sequence-length–linear scaling of KV-cache size, O(sd)O(s \cdot d) (with ss sequence length and dd head or model dimension), renders KV cache storage and access a major bottleneck in inference for modern LLMs and LMMs. This memory pressure constrains context length, attainable batch size, and hardware throughput, particularly when models serve real-time applications or operate at scale. Empirical studies have identified substantial redundancy and low-rank structure in cached key/value representations, motivating the search for FDC techniques that can realize significant compression while preserving model output fidelity.

A central challenge in FDC is to design methods that integrate with the distinct mechanics of transformer inference, including features such as rotary position embeddings (RoPE), group-query attention (GQA), and hybrid architectural modules (e.g., multimodal encoders), while maintaining compatibility with prevalent inference frameworks and hardware-accelerated kernels (Mu et al., 28 Oct 2025, Lin et al., 2024, Zhang et al., 2024, Yang et al., 2024, Roy et al., 7 Dec 2025, Lesens et al., 5 Dec 2025, Wang et al., 24 Mar 2026, Li et al., 26 Jul 2025, Jegou et al., 12 Jan 2026, Zhang et al., 28 Jul 2025).

2. Low-Rank and Latent-Space Approaches

2.1. Direct Low-Rank Projection

Several frameworks perform dimensionality reduction by projecting cached K/V matrices into lower-dimensional latent subspaces. The representative methods differ primarily in the choice of projection operator, training/calibration regime, and integration with transformer internals.

SALS: Sparse Attention in Latent Space achieves high compression (up to 6.4× at 4K context) and 5.7× operator speed-up by projecting pre-RoPE K/V into a latent space of rank rdr \ll d, performing RoPE-free query–key scoring in that space, selecting a sparse subset of important tokens, and reconstructing only those tokens in full dimension for downstream rotational and attention computation. This pipeline avoids the expensive full-key reconstruction that would otherwise negate compression benefits with RoPE, while maintaining negligible accuracy loss (Mu et al., 28 Oct 2025).

MatryoshkaKV introduces learned, trainable orthogonal projections per layer/head with a Matryoshka-style (nested) ranking schedule, enabling precise tradeoffs between memory and accuracy across a range of budgets. Projections are initialized using PCA, then refined using KL+LM distillation on a calibration set, supporting adaptive, per-head allocation of compression rates. This method achieves >90% original model accuracy at 60–75% compression, with failure of PCA-only projection evident at r/d<62.5%r/d < 62.5\% (Lin et al., 2024).

LoRC applies truncated SVD to attention weights WKW_K and WVW_V per head, factoring the key/value computation into smaller matrices and only storing the reduced cache dimensions and appropriately rewritten projection weights, without need for model retraining or specialized tuning. A progressive per-layer compression scheme modulates the compression ratio according to cumulative downstream condition number, mitigating error amplification in shallow layers. This yields up to 60% compression and matches full-cache performance across several LLaMA variants (Zhang et al., 2024).

2.2. Latent Representation Merging

CLLA (Cross-Layer Latent Attention) unifies multi-axis compression: it projects all key/value pairs into a single low-dimensional latent per-token vector, reuses these latents across FF layers (cross-layer sharing), and encodes in 4-bit quantized form. Reconstruction for attention proceeds by projection from the quantized latent, and the architecture is trained to be lossless—even at <2%<2\% of the original cache size—by fusing all compression components during training. Accuracy is preserved (or slightly improved) over baseline on 15 downstream tasks (Yang et al., 2024).

3. Similarity-Based, Adaptive, and Hybrid Methods

EchoKV leverages intra- and inter-layer similarity by explicitly retaining a subset of heads per group, then reconstructing the dropped heads from both local surviving heads and a “global” cache from an adjacent layer, using lightweight learned linear maps. Training the reconstructors is fast (<1 hour on 7B) and does not alter base model weights, facilitating on-demand switching between compressed or full KV inference. EchoKV achieves near-lossless accuracy at 2×–3× reduction, outperforms SVD-based methods at long context, and supports seamless toggling of compression at runtime (Wang et al., 24 Mar 2026).

KV-CAR combines lightweight per-layer autoencoders—compressing keys/values along the embedding dimension—with a similarity-driven “head reuse” policy: heads at adjacent layers sharing high L1L_1 similarity (thresholded at, e.g., 0.9) reuse cached representations, avoiding redundancy. Latents are optionally quantized, further reducing memory. KV-CAR shrinks KV demands by up to 50%, with minimal impact on model perplexity and accuracy (Roy et al., 7 Dec 2025).

4. Token/Slice-Selective and Frequency-Domain Compression

KVzap (and related token-pruning schemes) approaches FDC by adaptive, per-token pruning along the time axis: a small surrogate MLP, trained to imitate more costly KVzip+ scoring, ranks the information value of each KV cache position and discards those below a threshold, always retaining a specified recency window. KVzap achieves 2–4× compression with negligible accuracy loss (<0.3%), introduces negligible inference latency, and yields proportional speedups in memory-bandwidth-limited kernels (Jegou et al., 12 Jan 2026).

FAEDKV operates in the frequency domain, transforming the time axis of each KV vector using an Infinite-Window Discrete Fourier Transform (IWDFT). It retains only the most informative spectral components discovered by a layer-wise frequency ablation study, ensuring position-agnostic information retention without recency bias. Compression and reconstruction employ only the selected frequencies, resulting in state-of-the-art perplexity under extreme cache budgets (22% improvement over token-eviction methods at r0.1r\approx 0.1). FAEDKV supports fixed-budget, unbiased, and task-agnostic compression (Li et al., 26 Jul 2025).

5. Quantization, Pruning, and Mixed-Precision FDC

Adaptive quantization is another FDC axis: layer-wise quantization of cached tensors with uniform asymmetric functions, e.g., per-layer bitwidth b{6,8,16}b_\ell \in \{6,8,16\}, with optimal allocation discovered via black-box search (e.g., Tree-structured Parzen Estimator as in (Zhang et al., 28 Jul 2025)). By allocating higher bitwidth to more sensitive (often early) layers and more aggressive quantization deeper, FDC achieves high compression (e.g., 2–3×) with minor or negligible degradation of perplexity or accuracy (e.g., 0.2–0.3% on VQA tasks). These approaches can be combined with fast unstructured pruning of weights.

6. Implementation, Performance, and Practical Considerations

The following table summarizes core attributes and performance based on major FDC paradigms:

Method Typical Compression Key ML Primitive Accuracy/Trade-off Notable Results / Models
SALS (Mu et al., 28 Oct 2025) 6.4× (4K seq) + 5.7× op Low-rank, token sparsity, latent RoPE Negligible; SOTA op speed LLaMA2-7b-chat, Mistral-7b
MatryoshkaKV (Lin et al., 2024) 60–75% Learned orthogonal proj. >90% orig. at 60% cache, fails at lower LLaMA2-7B-base, Mistral-7B-v0.3-base
LoRC (Zhang et al., 2024) 55–60% SVD, progressive ranking <1% perf. drop (task, model dep.) LLaMA2-13B, LLaMA3-8B/70B
CLAA (Yang et al., 2024) 98% Multi-axis latent quant. Lossless, even at 2% cache English/Chinese downstream, generic
EchoKV (Wang et al., 24 Mar 2026) 2×–3× Similarity-based recon. <1% drop at 2×, basic at 3.3× Llama3.1-8B, Mistral-7B, RULER benchmark
KVzap (Jegou et al., 12 Jan 2026) 2–4× Input-adaptive pruning <0.3% drop, high throughput Qwen3-8B, Llama-3.1-8B-Instruct
FAEDKV (Li et al., 26 Jul 2025) >10× (freq. fraction) IWDFT, spectral ablation SOTA PPL, position-agnostic retrieval Llama3-8B, LongBench, Needle-in-a-Haystack
Adap. quant (Zhang et al., 28 Jul 2025) 2–3×+ Mixed-precision quant., pruning <0.3% VQA drop at 50% KV memory LLaVA-1.5 Vicuna-7B/13B

Integration and operational aspects include:

  • Most methods support plug-in compression to any pre-trained model (post-hoc, no retraining required), or require only rapid calibration/fine-tuning of small projection modules, reconstructors, or bit allocation schedules.
  • Integration with inference kernels is typically direct for latent or quantized methods; token/slice-selective approaches may demand variable-length sequence or sparse-aware kernels.
  • RoPE complicates naive low-rank projection due to variance inflation; pre-RoPE compression and latent-space scoring are essential countermeasures (Mu et al., 28 Oct 2025).
  • Quantization and pruning are often orthogonal and composable with latent or token-level FDC.

7. Limitations, Open Problems, and Outlook

While FDC enables dramatic reductions in KV cache size, latency, and energy, several technical considerations persist:

The proliferation of rigorously benchmarked, application-agnostic FDC techniques has rendered high-throughput, long-context LLM and LMM inference tractable even under aggressive hardware and energy limits, marking FDC as a core enabler for the next generation of scalable, practical AI deployments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast KV Dimensionality Compression (FDC).