Sub-Tensor KV Cache Optimization
- Sub-Tensor-Based KV Cache Optimization is a collection of techniques that partition large KV caches into smaller sub-tensors using methods like MPO decomposition and token-group partitioning.
- It employs adaptive quantization, pruning, and precision allocation strategies to achieve significant memory reductions, with empirical results showing up to 8x savings while maintaining inference accuracy.
- By integrating fused dequantization kernels and task-aware selection mechanisms, these techniques improve hardware utilization, support longer context windows, and scale efficiently in large language model deployments.
Sub-Tensor-Based KV Cache Optimization is a collection of algorithmic and systems techniques for reducing the memory and bandwidth requirements of key–value (KV) caches in LLMs, by partitioning KV tensors into smaller sub-tensors with targeted quantization, pruning, or computation sharing strategies. This paradigm underlies a diverse body of research, encompassing data-free matrix decomposition schemes, adaptive quantization, token- or channel-wise precision allocation, block-wise cache sharing, task-aware pruning, and hierarchical caching for advanced retrieval-augmented generation (RAG) scenarios. Empirical evidence demonstrates that sub-tensor-based optimizations can deliver order-of-magnitude KV memory reductions (2x–8x and beyond) while maintaining or minimally impacting LLM inference accuracy, facilitating longer context windows, larger batch sizes, and improved hardware utilization.
1. Fundamentals of Sub-Tensor Partitioning and Decomposition
Sub-tensor partitioning divides the original KV cache—which in typical transformer implementations is a large tensor of shape (sequence length attention heads head dimension)—into smaller, more manageable blocks, slices, or subgroups. Notable partitioning paradigms include:
- Matrix Product Operator (MPO) Decomposition: DecoQuant (Liu et al., 2024) decomposes each KV matrix into two local sub-tensors: ("large", 99% parameters) and ("small", 1%). Empirically, exhibits minimal outliers, permitting aggressive quantization, while absorbs original outliers and is stored at high precision.
- Token-Group and Channel Block Partitioning: MiniKV (Sharma et al., 2024), PackKV (Jiang et al., 30 Dec 2025), and KV Pareto (Gokhale et al., 1 Dec 2025) quantize per-block or per-group, typically with block sizes of 16–64, to localize scale and zero-point computation. This method balances quantization error and metadata overhead, aligning well with hardware vectorization.
- Sub-vector Quantization and Polar Transform: PolarQuant (Wu et al., 1 Feb 2025) partitions key vectors into 2D sub-tensors post-RoPE, quantizing each as (radius and angle), which are then reconstructed via lookup tables to enable memory-bound fast query-key inner product.
- Head-Partitioning and Pruning: Task-KV (He et al., 25 Jan 2025) and KVCrush (Jha et al., 24 Feb 2025) partition the cache by attention head semantics, adapting storage policies per head based on semantic distance or head-behaviour similarity metrics.
2. Sub-Tensor Quantization and Precision Allocation Schemes
Sub-tensor-based KV optimization leverages mixed and adaptive precision strategies:
- Data-Free Low-Bit Quantization: DecoQuant applies symmetric uniform quantization (e.g., 2, 4, 8 bits) to the large sub-tensor after MPO, leaving the small tensor in FP16; this yields bounded reconstruction error and calibration-free accuracy (Liu et al., 2024).
- Dynamic Channel-Wise Precision Boost: Kitty (Xia et al., 23 Nov 2025) ranks channels by activation sensitivity, retaining a small fraction (e.g., 12.5%, 25%) at 4 bits while quantizing the remainder at 2 bits per channel. The composite encoding is mapped into two unified 2-bit tensors streamlining Triton dequantization kernels.
- Sub-Bit Vector Quantization with Anchors: AnTKV (Li et al., 24 Jun 2025) keeps high-sensitivity tokens (identified via Anchor Score) in FP16, quantizing all other tokens to sub-bit rates (down to 0.375 bits) using loss-aware codebook clustering. This per-token adaptivity enables extreme compression ratios with only minimal perplexity loss.
- Block-Wise and Per-Window Quantization: PackKV and KV Pareto use block sizes of 32–64 for quantization granularity, with memory equations reflecting direct scaling— memory footprint for -bit block quantization (Gokhale et al., 1 Dec 2025, Jiang et al., 30 Dec 2025).
3. Sub-Tensor Selection, Pruning, and Structure-Aware Retention
Pruning and selection strategies further reduce memory load by retaining only essential tokens, subgroups, or heads:
- Representative Token and Bucket Pruning: KVCrush groups tokens by head-behaviour, computing binary similarity metrics (e.g., Hamming distance) and selecting representative tokens from contiguous buckets, yielding up to 4x memory reduction with 1% accuracy degradation (Jha et al., 24 Feb 2025).
- Task-Adaptive Semantic Window Selection: WindowKV (Zuo et al., 23 Mar 2025) forms contiguous semantic windows, scored for importance via task classifiers that balance localization versus aggregation. Budgets are assigned via arithmetic progression ("pyramid allocation"), with intra-group index sharing minimizing computational overhead.
- Composite Token Structures: KVCompose (Akulov et al., 5 Sep 2025) independently selects importance-based tokens per head and layer, then realigns them into uniform-length composite tokens, preserving standard tensor layout and compatibility with existing inference pipelines.
- Differential Head-Level Budgeting: Task-KV dynamically distinguishes heterogeneous from non-heterogeneous attention heads using semantic center distance, allocating full KV budget to salient heads and a reduced set (recent, sink, middle activations) to others, balancing memory savings and semantic coverage (He et al., 25 Jan 2025).
4. Fused Decompression and Hardware-Conscious Kernel Design
Efficient kernel implementations are integral to realizing theoretical memory savings in practical hardware contexts:
- Fused Dequantization and GEMM Kernels: DecoQuant’s custom GPU kernel unifies unpacking, scaling, multiplication, and accumulation, minimizing data-movement passes and harnessing warp-level parallelism (Liu et al., 2024). PackKV implements one-pass block-wise decompression and attention for maximum throughput.
- Tensor Core-Oriented Layouts: BitDecoding (Du et al., 24 Mar 2025) aligns packed low-bit blocks with Tensor Core tiling, enabling ldmatrix and wgmma dispatch without reshuffling, yielding up to 8.9x kernel-level speedup on H100.
- Triton-Compatible Page Dequantization: Kitty separates low- and high-bit packed tensors, utilizing index indirection for boosted channels, ensuring uniform on-chip dequantization and exploit hardware coalescence (Xia et al., 23 Nov 2025). MiniKV similarly fuses dequant and matmul to reduce memory traffic (Sharma et al., 2024).
- Attention Kernel Instrumentation: AnTKV integrates Anchor Score computation into FlashAttention, with the Triton kernel evaluating per-token sensitivity in parallel and enabling online selection and adaptation (Li et al., 24 Jun 2025).
5. Hierarchical Caching, Cache Sharing, and Advanced RAG
Sub-tensor optimization extends beyond quantization and pruning to shared and hierarchical KV caching:
- Block-Level Cache Fusion: Joint Encoding (Fast-Fusion) (Kampeas et al., 6 Jan 2026) merges similar blocks across multiple requests/chunks using cosine similarity trees, enabling shared cache representations with up to 4.38x compression and minimal distortion—all implemented via block-table pointer indirection.
- Subgraph-Level RAG Optimization: SubGCache (2505.10951) clusters graph-retrieved subgraphs into representative sub-tensors, precomputes and reuses their KV caches across clustered queries, achieving up to 6.68x speedup in time-to-first-token without accuracy loss.
- Two-Stage Permanent and Dynamic Eviction: RocketKV (Behnam et al., 19 Feb 2025) combines permanent coarse-grain token pruning via SnapKV++ with dynamic sparse head/sequence attention approximations, reaching compression ratios up to 400x and peak memory reductions of 30% at 1% accuracy drop.
6. Empirical Impact, Trade-Offs, and Integration Guidelines
Sub-tensor-based KV optimization empirically demonstrates:
- Compression Ratios: 4x–8x reductions for task- and structure-aware schemes, up to 86% for MiniKV (Sharma et al., 2024), and up to 400x in hybrid settings (Behnam et al., 19 Feb 2025).
- Accuracy Retention: 1–3% drop for aggressive quantization (2–4 bits), points for joint block fusion, and even 0.01 drop for composite/window retention at 12% memory (Akulov et al., 5 Sep 2025, Zuo et al., 23 Mar 2025).
- Throughput Gains: Up to 3.5x–7.5x kernel speedup on A100/H100 (BitDecoding), up to 8x larger batches (Kitty), and 40% serving throughput improvement in high-concurrency settings (Fast-Fusion).
- Integration: The majority of approaches retain compatibility with dense tensor formats, block tables (vLLM, Triton), and do not introduce extra inference-time kernel launches, facilitating drop-in adoption for modern inference engines.
7. Current Limitations and Design Constraints
While sub-tensor-based optimization provides substantial gains, it imposes notable constraints:
- Calibration and Sensitivity: Quantization scales must be tuned per model/workload; outlier migration, mixed-precision allocation, and dynamic anchor retention are essential for avoiding collapse at ultra-low bitwidths (Liu et al., 2024, Li et al., 24 Jun 2025, Xia et al., 23 Nov 2025).
- Hardware Requirements: Many schemes require advanced CUDA/PTX primitives (ldmatrix, wgmma) and vectorized kernels for peak performance (Du et al., 24 Mar 2025). Portability across GPU families may be limited.
- Task-Dependence: Empirical gains vary depending on attention structure, prompt locality, and semantic aggregation, highlighting the necessity for adaptive, task-aware indices (Task-KV, WindowKV, SubGCache).
- Metadata Overhead: Meta-information (scales, indices, pointer tables) scales linearly with blocks/sub-tensors; optimal blocksize must balance accuracy and metadata footprint (Gokhale et al., 1 Dec 2025).
- Padding and Alignment: Uniform sub-tensor sizes are required for engine compatibility, sometimes requiring padding of smaller layers/tensors (Akulov et al., 5 Sep 2025).
References
- DecoQuant: Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression (Liu et al., 2024)
- SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache (2505.10951)
- PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression (Jiang et al., 30 Dec 2025)
- PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration (Wu et al., 1 Feb 2025)
- KVCrush: Key Value Cache Size-Reduction Using Similarity in Head-Behaviour (Jha et al., 24 Feb 2025)
- Joint Encoding of KV-Cache Blocks for Scalable LLM Serving (Kampeas et al., 6 Jan 2026)
- BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache (Du et al., 24 Mar 2025)
- Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads (He et al., 25 Jan 2025)
- KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference (Gokhale et al., 1 Dec 2025)
- MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache (Sharma et al., 2024)
- KVCompose: Efficient Structured KV Cache Compression with Composite Tokens (Akulov et al., 5 Sep 2025)
- WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference (Zuo et al., 23 Mar 2025)
- AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in LLMs (Li et al., 24 Jun 2025)
- Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost (Xia et al., 23 Nov 2025)
- RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression (Behnam et al., 19 Feb 2025)