- The paper proposes selective KV cache compression tailored for multi-agent LLM collaboration to maximize downstream task accuracy under tight communication budgets.
- It introduces an Orthogonal Backfill mechanism that injects low-rank residuals to compensate for discarded latent states, effectively denoising the relay.
- Experiments show that OBF-enhanced compression reduces communication costs by approximately 80โ89% while achieving comparable or superior performance to full-KV relay.
Motivation and Background
Current advances in multi-agent systems for LLMs are increasingly focused on high-fidelity communication between agents. Conventional token-level message exchange constrains the expressiveness and utility of inter-agent coordination, collapsing rich internal states into lossy text form. Approaches such as LatentMAS enable direct relay of the full key-value (KV) cache, allowing an agent's successor to continue from its precise internal state rather than a textual approximation. However, full KV relay is prohibitive for bandwidth and memory, especially as the number of agents, depth of interaction, or prompt length increase.
The challenge addressed in "When Less Latent Leads to Better Relay: Information-Preserving Compression for Latent Multi-Agent LLM Collaboration" (2604.13349) lies in designing compression operators for the relay of latent (KV) representations in multi-agent LLM chains. Instead of merely minimizing cache size while preserving single-model performanceโa focus of prior work such as StreamingLLM, H2O, Scissorhands, WeightedKV, and othersโthe goal is to transmit a subset of the internal KV states optimized for maximizing downstream usability by the next agent in the relay, under a fixed communication budget.
Methodology
In the context of LatentMAS, inter-agent communication involves sharing the autoregressive decoder's KV cache after each agentโs turn. For each relay, the cache is partitioned into four distinct regions:
- Attention Sink: Persistent tokens that stabilize attention distributions, typically derived from the initial agentโs prompt.
- Inherited Message History: Past prompt-history relayed by upstream agents.
- Current Prompt Context: Latent states produced by the local instructions/prompt of the current agent.
- Current Latent Reasoning: Latent states produced during the agent's own reasoning/generation.
The central observation is that not all KV states are equally useful for downstream continuation. Thus, the relay-specific compression problem is: Given a fixed budget (e.g., up to k token positions per relay), select, optionally modify, and transmit prompt KV states to maximize final task accuracy across relayed generations.
Eviction Baselines
Two key baseline strategies adapted for the multi-agent context are:
- MAS-StreamingLLM: Relays only reasoning trajectory KV states (latest generations) and maintains a global sink. All prompt context except for the first agentโs sink is discarded after the respective agentโs turn.
- MAS-H2O: Selects prompt tokens to retain based on attention mass from the current reasoning states, as per the original H2O technique, with both global and per-layer/per-head variants. Top-k tokens are preserved according to head- or layer-aggregated attention.
Orthogonal Backfill (OBF)
To address the information loss resulting from bulk prompt eviction (as opposed to gradual online cache eviction in classical single-agent settings), the authors introduce Orthogonal Backfill (OBF)โa compensation mechanism that injects a low-rank, orthogonal residual from the discarded prompt value vectors back into the retained prompt KV states:
- Decompose the deleted prompt value matrix into components expressible vs. orthogonal to the span of the retained prompt values using QR decomposition.
- Principal Subspace Extraction: Apply SVD to the residual (orthogonal) component and summarize via attention-weighted mean, projecting onto the principal subspace.
- Demand-Driven Scaling: Scale the injection vector by the ratio of cumulative attention mass for deleted vs. retained prompts.
- Injection: Uniformly add the resulting residual vector to each of the retained prompt value states.
OBF is designed to recover latent global context not tightly anchored to any retained token, thereby increasing the informativeness of the compressed latent state for efficient downstream use.
Experimental Evaluation
Experiments deploy Qwen3-14B as the backbone and evaluate on nine diverse benchmarks, spanning mathematical reasoning (GSM8K, ARC, AIME), knowledge-intensive QA (GPQA, MedQA), and code generation (MBPP-Plus, HumanEval-Plus). All compression and relay methods utilize identical agent and prompt configurations, differing only in latent compression policy.
Strong results are observed:
- Compression Ratios: The proposed approaches consistently achieve sub-20% communication budgets, with an average retention ratio of 14.12%.
- Downstream Accuracy: In all benchmarks, at least one compressed relay variant matches or exceeds the full-KV relay baseline. OBF-enhanced relay achieves the best result on 7 of 9 tasks.
- Tradeoffs: Compressed relay sometimes outperforms full-KV relay. This is attributed to eviction acting as an implicit denoising step, removing spurious or task-irrelevant prompt states.
- Selector Granularity: The choice between head-wise and layer-wise retention does not yield a universally superior method, suggesting that optimal granularity is task-dependent and interacts with OBF augmentation. Head-wise is advantageous when signal is concentrated in specific heads; layer-wise is superior for distributed signal scenarios.
Summary of maximal accuracies (at k=32 prompt tokens per agent) is as follows (selected benchmarks):
| Method |
GSM8K |
AIME24 |
AIME25 |
GPQA |
ARC-E |
ARC-C |
MBPP+ |
HumanEval+ |
MedQA |
| Full KV |
92.47 |
67.78 |
53.33 |
56.51 |
98.62 |
95.82 |
75.66 |
84.55 |
77.34 |
| OBF-best |
92.57 |
71.11 |
64.44 |
57.19 |
98.64 |
95.76 |
77.07 |
85.37 |
80.45 |
Notably, OBF-augmented compression yields a 79.8%โ89.4% reduction in communication cost with no significant lossโand often a gainโin accuracy.
Practical and Theoretical Implications
This work demonstrates that, for multi-agent LLM relay via latent communication, preserving more information does not guarantee better performance. Selective compression augmented with subspace-level residual injection (OBF) can not only drastically reduce communication cost but also denoise and enhance the utility of the relayed state. This reframes inter-agent communication as a content-aware relay optimization problem, distinct from both text-based coordination and single-agent cache management.
From a theoretical perspective, the result motivates future analysis of which characteristics of the prompt KV cache drive downstream success under various task distributions, and how subspace projections or alternative compression criteria (beyond attention-based selection) might yield further improvements.
Outlook and Future Directions
Key open areas that emerge from this study include:
- Adaptive Compression Strategies: Task-aware or state-aware adaptation of compression granularity and injection, potentially guided by meta-reasoning or reinforcement learning.
- Multimodal and Cross-Architecture Transfer: Extending latent relay approaches to multimodal agents or hybrid systems (e.g., vision-language, retrieval-augmented setups).
- Scaling to Larger Agent Chains: Investigating propagation of compression errors and information bottlenecks in deep or cyclic agent chains.
- Theoretical Limits: Characterizing tight lower and upper bounds on necessary latent information for collaborative reasoning remains open.
- Integration with Retrieval and Planning: Combining latent relay compression with retrieval-augmented prompting or hierarchical plan decomposition.
Conclusion
The findings of this work establish information-preserving relay-oriented KV compression as an essential component in scalable, efficient, and effective multi-agent LLM collaboration. OBF demonstrates that judicious pruning and targeted compensation of the KV cache leads to superior inter-agent coordination under tight bandwidth constraints, challenging the traditional assumption that maximal state transfer is optimal. These insights will inform design principles for the next generation of collaborative AI systems capable of high-throughput, context-rich, and bandwidth-conscious reasoning.