- The paper introduces a unified shared KV cache pool that reduces memory overhead by up to 97.7% for multi-agent LLM inference with minimal degradation.
- The paper details a novel compression scheme using int8 linear quantization for keys and TurboQuant MSE for values, achieving a stable 2.91x compression ratio across configurations.
- The paper demonstrates agent-count invariant performance and suggests that quantization noise may act as a regularizer, enhancing inference quality in coherent contexts.
PolyKV: Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
Motivation and Problem Statement
The memory overhead of transformer-based LLM inference, dominated by the per-layer Key/Value (KV) cache, becomes critical as model scale and context length increase. Multi-agent systems, with N concurrent agents processing identical contexts, conventionally require N full-precision, isolated KV caches, introducing significant redundancy and constraining scaling. Previous research has focused independently on compressing KV caches per agent [liu2024kivi, hooper2024kvquant, zhang2024leankv] or on prefix-sharing strategies for multi-agent workflows [pan2025kvflow, ye2025kvcomm, kim2026lragent]. However, a unified approach accommodating both aggressive KV compression and multi-agent cache sharing was absent.
PolyKV System Architecture
PolyKV introduces a SharedKVPool abstraction that computes and compresses a single KV cache for a shared context, allowing N independent agents to inject decompressed tensors instantly into their own inference contexts. The write-once, read-many scheme avoids per-agent copy costs and cache contention, enabling sublinear memory scaling relative to agent count.
Compression Scheme
- Keys: Compressed via int8 linear quantization (q8_0). This preserves softmax precision, minimizing attention-related degradation.
- Values: Compressed using TurboQuant MSE, combining normalized FWHT rotation and 3-bit Lloyd-Max scalar quantization with centroid tuning to N(0,1). This achieves near-theoretic minimum distortion [turboquant2026].
Empirical results confirm a stable 2.91× compression ratio across model architectures and context lengths. SharedKVPool's memory footprint is O(1) in agent count.
DynamicCache Injection
PolyKV leverages HuggingFace DynamicCache’s layer-wise API, bypassing incremental prefill accumulation. Agents receive unique DynamicCache shells populated from the decompressed pool, enabling fully independent generation while maintaining memory economy.
Experimental Evaluation
Testbed
Two LLMs—SmolLM2-1.7B-Instruct (CPU inference) and Llama-3-8B-Instruct (GQA, bfloat16 KV cache, 32 layers)—were evaluated across three context lengths (600, 1,851, and 7,194 tokens) and up to 15 concurrent agents. Assessed metrics include perplexity (PPL), BERTScore F1 (semantic similarity), token overlap, KV cache memory, and compression ratio.
Key Numerical Results
- Compression Ratio: Stable 2.91× across all configurations.
- Memory Reduction: On Llama-3-8B with 15 agents and a 4K context, KV memory drops from 19.8GB to 0.45GB (97.7% reduction).
- Quality Preservation: PPL delta remains constant (e.g., +0.57\% at 4K context), and semantic equivalence is maintained (mean BERTScore F1 of 0.928 at 15 agents).
- Agent Scaling: PPL and BERTScore are invariant to agent count. Memory savings scale superlinearly (88.5% at 3 agents, 97.7% at 15).
- Context Length Scaling: Quality improves with longer contexts. On SmolLM2-1.7B at 1,851 coherent tokens, PolyKV cache achieves −0.26\% PPL delta, outperforming the baseline.
Contrasting with Prior Art
PolyKV is the first to implement and empirically validate a single shared, lossy-compressed KV pool accessed concurrently by multiple agents. All prior compression approaches operated on isolated caches, while multi-agent KV sharing used full precision. Compared with per-agent Q4 isolated caches (e.g., Agent Memory [anon2026agentmemory]), which incur +2.8–3.0\% PPL degradation, PolyKV consistently delivers sub-1.6\% delta and perfect or near-perfect semantic overlap.
Analytical and Theoretical Implications
Regularization Hypothesis
Empirical PPL inversion (compressed outperforming full-precision on highly coherent, long contexts) suggests that quantization noise from TurboQuant MSE acts as an implicit regularizer, disrupting spurious correlations inherent in full-precision Value tensors. This is especially pronounced in documents with high lexical repetition, mirroring effects analogous to dropout. The hypothesis predicts improved cache quality as context length and coherence increase.
Architectural Generality
PolyKV's compression gains and stability hold across different transformer variants (e.g., GQA vs. MHA), enabling architecture-agnostic adoption. The Gaussian approximation underlying TurboQuant is expected to be even more robust at greater head dimensions.
Scalability
PolyKV unlocks practical scaling of multi-agent inference workflows. By decoupling cache memory requirements from agent count, system-level deployment becomes feasible in resource-constrained environments, including edge devices.
Limitations and Open Questions
- Scale: Behavior at 70B+ parameter scale remains unexplored.
- Benchmarking: Current WikiText-2 experiments use fixed-window contexts, requiring stride-based evaluation for direct comparison to published results.
- System Metrics: Throughput, TTFT, and latency are deferred for future work.
- Context Length: Hardware constraints limit tested contexts to ≤8K tokens.
- Mechanistic Validation: The regularization hypothesis requires ablation studies and controlled coherence experiments.
Research Directions
- Expansion to full stride-based WikiText-2/C4 benchmarks for accurate quality comparison.
- Direct system benchmarking against per-agent Q4 and prefix-sharing paradigms.
- Generalization to larger LLMs (e.g., Qwen2.5-7B, 70B parameter scale).
- Detailed ablation to isolate the contribution of FWHT rotation versus uniform quantization.
- Systematic mapping of PPL delta against document coherence and repetition.
- Scaling agent count on high-memory platforms.
Conclusion
PolyKV demonstrates that asymmetrically-compressed shared KV pools are practical for multi-agent LLM inference, providing stable N0 memory reduction, agent-count invariant quality, and improved performance with longer coherent documents. No prior system offers concurrent, lossy-compressed cache sharing, marking PolyKV as a significant advancement in memory-efficient LLM inference architectures. These findings pave the way for further scaling and system-level evaluation in future research (2604.24971).