Joint Encoding of KV-Cache Blocks
- The paper demonstrates that fusing similar key-value blocks via clustering and quantization can achieve up to 4.4× cache compression with minimal accuracy loss.
- Joint encoding is defined as a method to reduce redundancy by sharing low-dimensional representations across blocks, resulting in up to an 84.8% throughput gain.
- System-level integration of joint encoding enables drop-in enhancements for transformer inference, maintaining per-token attention distributions while optimizing memory and latency.
Joint encoding of KV-cache blocks denotes a class of memory and throughput optimization strategies for Transformer inference that fuse, compress, or otherwise share the representation of key-value (KV) cache data across temporally or contextually related segments. Standard per-token, per-head caching is highly redundant, especially under long-context or high-concurrency serving, motivating research into block-level redundancies, collaborative reuse, and low-dimensional surrogate representations. Methods for joint encoding span lossless sharing via clustering, block fusion by quantized or low-rank approximations, codebook-driven quantization, Huffman/entropy models, and cache-aware system placement. These approaches aim to maximize memory reduction and throughput while strictly controlling impact on per-token attention distributions and task accuracy.
1. Formalization of Joint KV-Cache Encoding
Let a transformer’s per-layer KV cache at step and layer consist of keys and values . A joint encoding scheme replaces the default store-all regime with a policy that, for groups of tokens (-length blocks, e.g.), encodes them by some injective mapping into a lower-cardinality or lower-rank set of blocks, often fusing highly similar blocks into a canonical representative or a lower-dimensional subspace.
The core primitives in joint encoding include:
- Block Partitioning: Splitting a KV-cache into contiguous or disjoint blocks of shape each.
- Similarity Measurement: For blocks (flattened as vectors), defining similarity , commonly via cosine similarity, and forming similarity graphs or clusters.
- Fusion/Compression: Representing a set of similar blocks by a common unit vector direction , storing only per-block scaling norms or coefficients (Kampeas et al., 6 Jan 2026), or, more generally, projecting to a shared low-rank basis or quantization codebook (Zhou et al., 3 Mar 2025, Li et al., 23 Jun 2025).
- Mapping Table: All KV-cache index pointers are redirected to the fused or compressed representations for subsequent attention.
This approach preserves native cache layouts (no disruption to block-wise memory), is query- and request-agnostic, and readily integrates with paged-attention systems for high concurrency (Kampeas et al., 6 Jan 2026).
2. Block Similarity, Clustering, and Collaborative Filtering
A central mechanism in many joint encoding approaches is the detection of similar or redundant blocks to maximize sharing (Chen et al., 29 Jul 2025, Kampeas et al., 6 Jan 2026). The process involves:
- Computing similarity metrics (typically cosine similarity) between all candidate blocks (potentially across concurrent requests).
- Applying clustering, such as -means or threshold-based linkage, to form groups where within-group similarity exceeds a threshold (e.g., ).
- In some systems, a two-stage filtering, first using lexical or token-level bag-of-words histograms for a short-list, then high-precision block-level feature comparisons (Chen et al., 29 Jul 2025).
Theoretical analysis, e.g., via a Poisson process model, allows for precise control of the tradeoff between the number of shared blocks (rate) and the distortion in resulting attention distributions (distortion). For a given similarity threshold , the per-token logit perturbation can be tightly bounded as , which directly controls the change in the output softmax (Kampeas et al., 6 Jan 2026).
Empirically, batch-wise fast fusion realizes up to cache compression with negligible accuracy loss on models up to Qwen-2.5-72B (Kampeas et al., 6 Jan 2026). MemShare demonstrated throughput gains of (correlated with the inverse memory ratio) via such collaborative filtering and block deduplication (Chen et al., 29 Jul 2025).
3. Algorithms for Block Fusion, Mapping, and Reuse
Joint encoding is realized through data-driven block fusion at the cache management and system level without custom hardware. Key algorithmic stages include:
- Block Partition and Normalization: Arrange the blocks across requests/chunks, normalize each block for directionality, and retain per-block norms.
- Recursive or Batched Similarity Matching: Construct matrices of pairwise cosine similarities, recursively partition block sets for scalable fusion. Upon detection of sufficient similarity, fuse blocks and update mapping tables.
- Zero-Copy Reuse: All requests indexed to the fused block use a single in-memory copy, i.e., pointers from logical blocks map to physical storage of shared representatives (Chen et al., 29 Jul 2025).
- Attention Computation: At decode time, attention dot products for each query are computed with the shared direction, scaling outputs by the original block norm to maintain per-block magnitude (Kampeas et al., 6 Jan 2026).
Pseudocode for these procedures defines efficient mechanisms for maintaining reference chains, candidate search, and resource pooling, crucial for practical deployment (Chen et al., 29 Jul 2025).
4. Memory, Throughput, and Rate-Distortion Analysis
Joint encoding fundamentally alters the memory and bandwidth profile of LLM inference:
- Pre-encoding Memory:
- Post-encoding Memory:
- Compression Ratio:
- Throughput Scaling: Decoding speed scales inversely with memory traffic, so a reduction to memory yields an throughput gain (Chen et al., 29 Jul 2025)
- End-to-End Serving Gains: On vLLM, batch fusion brings token throughput gain, with corresponding improvements in time-to-first-token and inter-token latency (Kampeas et al., 6 Jan 2026).
Theoretical analysis via Poisson arrivals of high similarity and kernel density estimation enables precise rate-distortion optimization, guiding the threshold for block fusion to maximize compression under a strict attention distortion cap (Kampeas et al., 6 Jan 2026).
5. Comparison to and Integration with Prior Compression Frameworks
Joint encoding complements or subsumes structured quantization, token eviction, entropy coding, and cross-layer sharing paradigms:
- Beyond Prefix Sharing: Unlike deterministic prefix-sharing (which fails with even minor input divergence), joint encoding fuses blocks across diverse requests or input chunks solely based on content similarity (Kampeas et al., 6 Jan 2026).
- Integration with Quantization: Joint encoding can be layered with vector quantization or low-rank dictionary models, leading to further memory reduction (combinatorial rate-distortion frameworks are an open direction) (Zhou et al., 3 Mar 2025).
- Compatibility: No custom kernels or disruption of cache layouts are required; the method operates transparently atop existing paged-attention or serving schedulers (Kampeas et al., 6 Jan 2026).
- System-Level Placement: Methods such as EvicPress use unified utility functions to co-optimize compression with cache tier placement in the presence of variable access frequency, device hierarchy, and end-to-end latency constraints (Feng et al., 16 Dec 2025).
In effect, cluster-based block fusion as in MemShare is the degenerate limit of joint encoding (each cluster reduces to a singleton basis); extending to more expressive factorized or codebook-equipped models constitutes a major avenue for further gains (Chen et al., 29 Jul 2025).
6. Extensions and Future Directions in Joint KV-Block Encoding
The fundamental concept of joint encoding generalizes naturally in several directions:
- Low-Rank and Dictionary Models: Learning shared low-rank or overcomplete dictionaries across all blocks can capture both intra-request and inter-request redundancy, extending beyond hard clustering (Zhou et al., 3 Mar 2025, Chen et al., 29 Jul 2025).
- Hybrid Blockwise–Elementwise Quantization: Combining similarity-based fusion with quantization of the remaining basis or representative blocks offers promising compounded compression.
- Adaptive Fusion Thresholds: Employing online calibration policies to adjust similarity thresholds ensures that compression is tuned dynamically to batch heterogeneity and workload diversity (Kampeas et al., 6 Jan 2026).
- Plug-and-Play Deployment: Joint encoding is a drop-in enhancement for all Paged-Attention–style LLM serving frameworks, with all required structures (block pointer tables, norm vectors) fitting in existing memory models (Kampeas et al., 6 Jan 2026).
Open challenges include maximizing fusion in highly heterogeneous batches, refining distortion metrics to enable more aggressive sharing, and formalizing joint encoding schemes that incorporate quantization, block pruning, and dictionary learning within a unified rate-distortion-theoretic framework.