Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sub-Tensor KV Cache Optimization

Updated 30 January 2026
  • Sub-Tensor-Based KV Cache Optimization is a collection of techniques that partition large KV caches into smaller sub-tensors using methods like MPO decomposition and token-group partitioning.
  • It employs adaptive quantization, pruning, and precision allocation strategies to achieve significant memory reductions, with empirical results showing up to 8x savings while maintaining inference accuracy.
  • By integrating fused dequantization kernels and task-aware selection mechanisms, these techniques improve hardware utilization, support longer context windows, and scale efficiently in large language model deployments.

Sub-Tensor-Based KV Cache Optimization is a collection of algorithmic and systems techniques for reducing the memory and bandwidth requirements of key–value (KV) caches in LLMs, by partitioning KV tensors into smaller sub-tensors with targeted quantization, pruning, or computation sharing strategies. This paradigm underlies a diverse body of research, encompassing data-free matrix decomposition schemes, adaptive quantization, token- or channel-wise precision allocation, block-wise cache sharing, task-aware pruning, and hierarchical caching for advanced retrieval-augmented generation (RAG) scenarios. Empirical evidence demonstrates that sub-tensor-based optimizations can deliver order-of-magnitude KV memory reductions (2x–8x and beyond) while maintaining or minimally impacting LLM inference accuracy, facilitating longer context windows, larger batch sizes, and improved hardware utilization.

1. Fundamentals of Sub-Tensor Partitioning and Decomposition

Sub-tensor partitioning divides the original KV cache—which in typical transformer implementations is a large tensor of shape RL×H×D\mathbb{R}^{L \times H \times D} (sequence length ×\times attention heads ×\times head dimension)—into smaller, more manageable blocks, slices, or subgroups. Notable partitioning paradigms include:

  • Matrix Product Operator (MPO) Decomposition: DecoQuant (Liu et al., 2024) decomposes each KV matrix W∈RI×JW \in \mathbb{R}^{I \times J} into two local sub-tensors: TLT_L ("large", ∼\sim99% parameters) and TST_S ("small", ∼\sim1%). Empirically, TLT_L exhibits minimal outliers, permitting aggressive quantization, while TST_S absorbs original outliers and is stored at high precision.
  • Token-Group and Channel Block Partitioning: MiniKV (Sharma et al., 2024), PackKV (Jiang et al., 30 Dec 2025), and KV Pareto (Gokhale et al., 1 Dec 2025) quantize per-block or per-group, typically with block sizes of 16–64, to localize scale and zero-point computation. This method balances quantization error and metadata overhead, aligning well with hardware vectorization.
  • Sub-vector Quantization and Polar Transform: PolarQuant (Wu et al., 1 Feb 2025) partitions key vectors into 2D sub-tensors post-RoPE, quantizing each as (rt,j,θt,j)(r_{t,j}, \theta_{t,j}) (radius and angle), which are then reconstructed via lookup tables to enable memory-bound fast query-key inner product.
  • Head-Partitioning and Pruning: Task-KV (He et al., 25 Jan 2025) and KVCrush (Jha et al., 24 Feb 2025) partition the cache by attention head semantics, adapting storage policies per head based on semantic distance or head-behaviour similarity metrics.

2. Sub-Tensor Quantization and Precision Allocation Schemes

Sub-tensor-based KV optimization leverages mixed and adaptive precision strategies:

  • Data-Free Low-Bit Quantization: DecoQuant applies symmetric uniform quantization (e.g., 2, 4, 8 bits) to the large sub-tensor after MPO, leaving the small tensor in FP16; this yields bounded reconstruction error and calibration-free accuracy (Liu et al., 2024).
  • Dynamic Channel-Wise Precision Boost: Kitty (Xia et al., 23 Nov 2025) ranks channels by activation sensitivity, retaining a small fraction (e.g., 12.5%, 25%) at 4 bits while quantizing the remainder at 2 bits per channel. The composite encoding is mapped into two unified 2-bit tensors streamlining Triton dequantization kernels.
  • Sub-Bit Vector Quantization with Anchors: AnTKV (Li et al., 24 Jun 2025) keeps high-sensitivity tokens (identified via Anchor Score) in FP16, quantizing all other tokens to sub-bit rates (down to 0.375 bits) using loss-aware codebook clustering. This per-token adaptivity enables extreme compression ratios with only minimal perplexity loss.
  • Block-Wise and Per-Window Quantization: PackKV and KV Pareto use block sizes of 32–64 for quantization granularity, with memory equations reflecting direct scaling—b/16b/16 memory footprint for bb-bit block quantization (Gokhale et al., 1 Dec 2025, Jiang et al., 30 Dec 2025).

3. Sub-Tensor Selection, Pruning, and Structure-Aware Retention

Pruning and selection strategies further reduce memory load by retaining only essential tokens, subgroups, or heads:

  • Representative Token and Bucket Pruning: KVCrush groups tokens by head-behaviour, computing binary similarity metrics (e.g., Hamming distance) and selecting representative tokens from contiguous buckets, yielding up to 4x memory reduction with ≤\leq1% accuracy degradation (Jha et al., 24 Feb 2025).
  • Task-Adaptive Semantic Window Selection: WindowKV (Zuo et al., 23 Mar 2025) forms contiguous semantic windows, scored for importance via task classifiers that balance localization versus aggregation. Budgets are assigned via arithmetic progression ("pyramid allocation"), with intra-group index sharing minimizing computational overhead.
  • Composite Token Structures: KVCompose (Akulov et al., 5 Sep 2025) independently selects importance-based tokens per head and layer, then realigns them into uniform-length composite tokens, preserving standard tensor layout and compatibility with existing inference pipelines.
  • Differential Head-Level Budgeting: Task-KV dynamically distinguishes heterogeneous from non-heterogeneous attention heads using semantic center distance, allocating full KV budget to salient heads and a reduced set (recent, sink, middle activations) to others, balancing memory savings and semantic coverage (He et al., 25 Jan 2025).

4. Fused Decompression and Hardware-Conscious Kernel Design

Efficient kernel implementations are integral to realizing theoretical memory savings in practical hardware contexts:

  • Fused Dequantization and GEMM Kernels: DecoQuant’s custom GPU kernel unifies unpacking, scaling, multiplication, and accumulation, minimizing data-movement passes and harnessing warp-level parallelism (Liu et al., 2024). PackKV implements one-pass block-wise decompression and attention for maximum throughput.
  • Tensor Core-Oriented Layouts: BitDecoding (Du et al., 24 Mar 2025) aligns packed low-bit blocks with Tensor Core tiling, enabling ldmatrix and wgmma dispatch without reshuffling, yielding up to 8.9x kernel-level speedup on H100.
  • Triton-Compatible Page Dequantization: Kitty separates low- and high-bit packed tensors, utilizing index indirection for boosted channels, ensuring uniform on-chip dequantization and exploit hardware coalescence (Xia et al., 23 Nov 2025). MiniKV similarly fuses dequant and matmul to reduce memory traffic (Sharma et al., 2024).
  • Attention Kernel Instrumentation: AnTKV integrates Anchor Score computation into FlashAttention, with the Triton kernel evaluating per-token sensitivity in parallel and enabling online selection and adaptation (Li et al., 24 Jun 2025).

5. Hierarchical Caching, Cache Sharing, and Advanced RAG

Sub-tensor optimization extends beyond quantization and pruning to shared and hierarchical KV caching:

  • Block-Level Cache Fusion: Joint Encoding (Fast-Fusion) (Kampeas et al., 6 Jan 2026) merges similar blocks across multiple requests/chunks using cosine similarity trees, enabling shared cache representations with up to 4.38x compression and minimal distortion—all implemented via block-table pointer indirection.
  • Subgraph-Level RAG Optimization: SubGCache (2505.10951) clusters graph-retrieved subgraphs into representative sub-tensors, precomputes and reuses their KV caches across clustered queries, achieving up to 6.68x speedup in time-to-first-token without accuracy loss.
  • Two-Stage Permanent and Dynamic Eviction: RocketKV (Behnam et al., 19 Feb 2025) combines permanent coarse-grain token pruning via SnapKV++ with dynamic sparse head/sequence attention approximations, reaching compression ratios up to 400x and peak memory reductions of ∼\sim30% at ≤\leq1% accuracy drop.

6. Empirical Impact, Trade-Offs, and Integration Guidelines

Sub-tensor-based KV optimization empirically demonstrates:

  • Compression Ratios: 4x–8x reductions for task- and structure-aware schemes, up to 86% for MiniKV (Sharma et al., 2024), and up to 400x in hybrid settings (Behnam et al., 19 Feb 2025).
  • Accuracy Retention: ≤\leq1–3% drop for aggressive quantization (2–4 bits), <0.3<0.3 points for joint block fusion, and even ∼\sim0.01 drop for composite/window retention at 12% memory (Akulov et al., 5 Sep 2025, Zuo et al., 23 Mar 2025).
  • Throughput Gains: Up to 3.5x–7.5x kernel speedup on A100/H100 (BitDecoding), up to 8x larger batches (Kitty), and 40% serving throughput improvement in high-concurrency settings (Fast-Fusion).
  • Integration: The majority of approaches retain compatibility with dense tensor formats, block tables (vLLM, Triton), and do not introduce extra inference-time kernel launches, facilitating drop-in adoption for modern inference engines.

7. Current Limitations and Design Constraints

While sub-tensor-based optimization provides substantial gains, it imposes notable constraints:

  • Calibration and Sensitivity: Quantization scales must be tuned per model/workload; outlier migration, mixed-precision allocation, and dynamic anchor retention are essential for avoiding collapse at ultra-low bitwidths (Liu et al., 2024, Li et al., 24 Jun 2025, Xia et al., 23 Nov 2025).
  • Hardware Requirements: Many schemes require advanced CUDA/PTX primitives (ldmatrix, wgmma) and vectorized kernels for peak performance (Du et al., 24 Mar 2025). Portability across GPU families may be limited.
  • Task-Dependence: Empirical gains vary depending on attention structure, prompt locality, and semantic aggregation, highlighting the necessity for adaptive, task-aware indices (Task-KV, WindowKV, SubGCache).
  • Metadata Overhead: Meta-information (scales, indices, pointer tables) scales linearly with blocks/sub-tensors; optimal blocksize must balance accuracy and metadata footprint (Gokhale et al., 1 Dec 2025).
  • Padding and Alignment: Uniform sub-tensor sizes are required for engine compatibility, sometimes requiring padding of smaller layers/tensors (Akulov et al., 5 Sep 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sub-Tensor-Based KV Cache Optimization.