PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Published 27 Apr 2026 in cs.LG, cs.CL, and cs.DC | (2604.24971v1)

Abstract: We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a unified shared KV cache pool that reduces memory overhead by up to 97.7% for multi-agent LLM inference with minimal degradation.
The paper details a novel compression scheme using int8 linear quantization for keys and TurboQuant MSE for values, achieving a stable 2.91x compression ratio across configurations.
The paper demonstrates agent-count invariant performance and suggests that quantization noise may act as a regularizer, enhancing inference quality in coherent contexts.

PolyKV: Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Motivation and Problem Statement

The memory overhead of transformer-based LLM inference, dominated by the per-layer Key/Value (KV) cache, becomes critical as model scale and context length increase. Multi-agent systems, with $N$ concurrent agents processing identical contexts, conventionally require $N$ full-precision, isolated KV caches, introducing significant redundancy and constraining scaling. Previous research has focused independently on compressing KV caches per agent [liu2024kivi, hooper2024kvquant, zhang2024leankv] or on prefix-sharing strategies for multi-agent workflows [pan2025kvflow, ye2025kvcomm, kim2026lragent]. However, a unified approach accommodating both aggressive KV compression and multi-agent cache sharing was absent.

PolyKV System Architecture

PolyKV introduces a SharedKVPool abstraction that computes and compresses a single KV cache for a shared context, allowing $N$ independent agents to inject decompressed tensors instantly into their own inference contexts. The write-once, read-many scheme avoids per-agent copy costs and cache contention, enabling sublinear memory scaling relative to agent count.

Compression Scheme

Keys: Compressed via int8 linear quantization (q8_0). This preserves softmax precision, minimizing attention-related degradation.
Values: Compressed using TurboQuant MSE, combining normalized FWHT rotation and 3-bit Lloyd-Max scalar quantization with centroid tuning to $\mathcal{N}(0,1)$ . This achieves near-theoretic minimum distortion [turboquant2026].

Empirical results confirm a stable $2.91\times$ compression ratio across model architectures and context lengths. SharedKVPool's memory footprint is $O(1)$ in agent count.

DynamicCache Injection

PolyKV leverages HuggingFace DynamicCache’s layer-wise API, bypassing incremental prefill accumulation. Agents receive unique DynamicCache shells populated from the decompressed pool, enabling fully independent generation while maintaining memory economy.

Experimental Evaluation

Testbed

Two LLMs—SmolLM2-1.7B-Instruct (CPU inference) and Llama-3-8B-Instruct (GQA, bfloat16 KV cache, 32 layers)—were evaluated across three context lengths (600, 1,851, and 7,194 tokens) and up to 15 concurrent agents. Assessed metrics include perplexity (PPL), BERTScore F1 (semantic similarity), token overlap, KV cache memory, and compression ratio.

Key Numerical Results

Compression Ratio: Stable $2.91\times$ across all configurations.
Memory Reduction: On Llama-3-8B with 15 agents and a 4K context, KV memory drops from 19.8GB to 0.45GB (97.7% reduction).
Quality Preservation: PPL delta remains constant (e.g., $+$ 0.57\% at 4K context), and semantic equivalence is maintained (mean BERTScore F1 of 0.928 at 15 agents).
Agent Scaling: PPL and BERTScore are invariant to agent count. Memory savings scale superlinearly (88.5% at 3 agents, 97.7% at 15).
Context Length Scaling: Quality improves with longer contexts. On SmolLM2-1.7B at 1,851 coherent tokens, PolyKV cache achieves $-$ 0.26\% PPL delta, outperforming the baseline.

Contrasting with Prior Art

PolyKV is the first to implement and empirically validate a single shared, lossy-compressed KV pool accessed concurrently by multiple agents. All prior compression approaches operated on isolated caches, while multi-agent KV sharing used full precision. Compared with per-agent Q4 isolated caches (e.g., Agent Memory [anon2026agentmemory]), which incur $+$ 2.8–3.0\% PPL degradation, PolyKV consistently delivers sub-1.6\% delta and perfect or near-perfect semantic overlap.

Analytical and Theoretical Implications

Regularization Hypothesis

Empirical PPL inversion (compressed outperforming full-precision on highly coherent, long contexts) suggests that quantization noise from TurboQuant MSE acts as an implicit regularizer, disrupting spurious correlations inherent in full-precision Value tensors. This is especially pronounced in documents with high lexical repetition, mirroring effects analogous to dropout. The hypothesis predicts improved cache quality as context length and coherence increase.

Architectural Generality

PolyKV's compression gains and stability hold across different transformer variants (e.g., GQA vs. MHA), enabling architecture-agnostic adoption. The Gaussian approximation underlying TurboQuant is expected to be even more robust at greater head dimensions.

Scalability

PolyKV unlocks practical scaling of multi-agent inference workflows. By decoupling cache memory requirements from agent count, system-level deployment becomes feasible in resource-constrained environments, including edge devices.

Limitations and Open Questions

Scale: Behavior at 70B+ parameter scale remains unexplored.
Benchmarking: Current WikiText-2 experiments use fixed-window contexts, requiring stride-based evaluation for direct comparison to published results.
System Metrics: Throughput, TTFT, and latency are deferred for future work.
Context Length: Hardware constraints limit tested contexts to ≤8K tokens.
Mechanistic Validation: The regularization hypothesis requires ablation studies and controlled coherence experiments.

Research Directions

Expansion to full stride-based WikiText-2/C4 benchmarks for accurate quality comparison.
Direct system benchmarking against per-agent Q4 and prefix-sharing paradigms.
Generalization to larger LLMs (e.g., Qwen2.5-7B, 70B parameter scale).
Detailed ablation to isolate the contribution of FWHT rotation versus uniform quantization.
Systematic mapping of PPL delta against document coherence and repetition.
Scaling agent count on high-memory platforms.

Conclusion

PolyKV demonstrates that asymmetrically-compressed shared KV pools are practical for multi-agent LLM inference, providing stable $N$ 0 memory reduction, agent-count invariant quality, and improved performance with longer coherent documents. No prior system offers concurrent, lossy-compressed cache sharing, marking PolyKV as a significant advancement in memory-efficient LLM inference architectures. These findings pave the way for further scaling and system-level evaluation in future research (2604.24971).