Cache Sharing in Multi-Agent Systems
- Cache sharing across agents is a strategy that enables multiple autonomous systems to reuse computation results, reducing redundancy and enhancing performance.
- Key mechanisms include semantic vector caches, copy-on-write, quantized handoff, and adaptive prioritization to optimize memory and throughput.
- Advanced eviction, prefetching, and resource management algorithms maintain consistency and security in dynamic, multi-agent workflows.
Cache sharing across agents refers to collaborative strategies, architectural mechanisms, and optimization algorithms that allow multiple autonomous entities—typically LLM-based agents or distributed RL nodes—to read from, write to, or coordinate over shared computational and storage caches. The goal is to substantially reduce redundant computation, memory consumption, and data transfer, while preserving task-specific correctness, agent isolation, and scalability. Modern research defines and implements cache sharing not just via low-level tensor reuse (as in transformer KV caches), but more generally through techniques spanning semantic vector caches, lossy compression, quantized handoff, copy-on-write semantics, asynchronous collective reuse, and adaptive multi-agent prioritization. The field emphasizes pragmatic trade-offs among throughput, hit rate, memory utilization, consistency, and security, especially in large-scale or highly dynamic multi-agent workflows.
1. Design Principles and Fundamental Abstractions
Effective cache sharing relies on a precise specification of what is shared, who may access or mutate the cache, and how correctness and efficiency are maintained. Leading systems introduce the following abstractions:
- Semantic Elements (SEs): Each agent-issued remote call or query is encapsulated as an SE: a tuple comprising the raw query , the response , a semantic embedding , performance metadata (latency , monetary cost , staticity score ), the response size, and access frequency. SEs populate the semantic cache (Ruan et al., 22 Sep 2025).
- KV-Cache Blocks: Transformer-based agents maintain key–value (KV) caches at each layer/head for each prefix or trajectory segment. Cache block granularity is critical for inter-agent reuse and fine-grained management (Jeon et al., 1 Feb 2026, Patel et al., 27 Apr 2026, Bian et al., 3 Apr 2026).
- Asymmetrically Compressed Shared Pools: Systems such as PolyKV maintain a single, lossy-compressed cache pool that is concurrently accessed by many agents, decoupling per-agent inference from redundant full-precision cache instantiation (Patel et al., 27 Apr 2026).
- Copy-on-Write (CoW) and Dual-Tier Indexing: ForkKV implements an OS-inspired CoW mechanism, separating shared base KV cache from agent-specific residuals and using scalable radix trees for per-agent divergent writes (Wang et al., 7 Apr 2026).
- Anchor Pools for Offset Correction: KVCOMM enables offset-aligned cache handoff by dynamically maintaining anchor pools mapping context embeddings to empirically observed cache offsets, facilitating cross-context reuse under diverse prefix extensions (Ye et al., 14 Oct 2025).
2. Semantic Caching and Retrieval Algorithms
Semantically aware agent cache sharing seeks to maximize reuse not just for syntactically identical queries but also for semantically similar or paraphrased statements, tool invocations, or prompts.
- Semantic Retrieval Indexing (Sine): A two-stage retrieval mechanism: (a) vector-similarity search using approximate nearest neighbor (ANN) over semantic embeddings to identify candidate cache elements, followed by (b) an LLM-powered judger that semantically validates the relevance of those candidates, enforcing a calibrated precision threshold (e.g., ≥99%) (Ruan et al., 22 Sep 2025).
- Semantic Cache Hit Definition: A cache hit is realized only if both the embedding similarity and semantic judger jointly validate the match:
This ensures precise semantic correctness in multi-agent tools and eliminates false positives (Ruan et al., 22 Sep 2025).
- Micro-caching and Attention-Guided Reuse: Agent-centric data fabrics rely on encode–lookup-evict policies over semantic micro-caches (vector-indexed small fragments), coordinated by attention-guided routers and cross-agent cache managers that propagate high-utility cache fragments across agents on semantic overlap (Giurgiu et al., 10 Dec 2025).
3. Shared KV-Cache Mechanisms
State-of-the-art transformers in multi-agent systems maintain large-scale KV caches, whose naive per-agent replication is infeasible. Modern solutions include:
- KV-Cache Pooling and Decompression: PolyKV’s SharedKVPool compresses all key tensors at int8 (q8_0) and value tensors via 3-bit Fast Walsh-Hadamard Transform (TurboQuant MSE), achieving stable 2.91× compression while preserving softmax stability and downstream accuracy (PPL typically <1%) (Patel et al., 27 Apr 2026).
- Copy-on-Write and ResidualAttention: ForkKV’s CoW decouples agent-specific low-rank activations from a shared base, and reconstructs disaggregated KV cache directly within on-chip SRAM via ResidualAttention. This achieves >10× memory savings and up to 3× throughput under multi-LoRA workloads, with minimal quality degradation (Wang et al., 7 Apr 2026).
- Low-Rank Adapter KV Sharing: LRAgent decomposes KV into base and low-rank adapter components, sharing the base across agents and, subject to adapter orthogonality, sharing the low-rank cache as well—materializing agent-specific deltas only as needed. The Flash-LoRA-Attention kernel avoids explicit expansion of low-rank cache, maintaining high throughput (Jeon et al., 1 Feb 2026).
- Quantized Handoff and Mixed-Precision Cards: QKVShare formalizes agent-to-agent cache transfer as a quantized CacheCard with token-level mixed-precision (adaptive bit-width from 2–16), enabling up to 2.8× density over FP16 and sublinear handoff latency scaling (TTFT reduced from 1,030 ms to 397 ms at 8K tokens) while preserving multi-hop accuracy within 1–3% of baseline (Honavar et al., 5 May 2026).
- Diff-Aware KV Sharing: TokenDance exploits synchronized all-gather patterns among agents, collectively computing a master KV-cache and storing per-agent block-sparse diffs, yielding 11–17× compression and scaling to 2.7× more agents under SLO than prefix caching (Bian et al., 3 Apr 2026).
4. Eviction, Prefetching, and Dynamic Resource Management
Multi-agent cache sharing necessitates advanced eviction and data movement strategies to avoid catastrophic contention while targeting near-optimal reuse.
- Value-Weighted and Lifecycle-Aware Eviction: Algorithms like LCFU (Least Cost-Efficient and Frequently Used) and PBKV’s Score(c) combine access frequency, cost, retrieval latency, and staticity to prioritize entries. Hierarchical eviction ensures that retired cache (not needed by any live workflow) is always evicted before high-score active cache (Ruan et al., 22 Sep 2025, Zheng et al., 7 May 2026).
- Predictive and Proactive Prefetching: Markov chain models or agent-embedding sequence predictors guide asynchronous background fetches of likely-to-be-used cache entries, always loading into otherwise idle memory to avoid displacing high-value live cache (Ruan et al., 22 Sep 2025, Zheng et al., 7 May 2026).
- Elastic Memory Pooling and Peer-to-Peer Exchange: SwiftCache orchestrates on-server, NVLink-backed sharing of idle GPU memory, streaming prefix cache blocks from low-demand (worker) models to high-demand (master) ones, pipelined per-layer. This decouples context length support from single-model HBM constraints and achieves up to 3.98× extension in supported context length relative to conventional strategies (Hu et al., 15 Jun 2026).
- Agent-Centric Partitioned and Time-Aware Scheduling: Tokencake uses a space scheduler for partitioning GPU cache across critical vs. non-critical agents, and a time scheduler for proactively offloading stalled agent caches (e.g., during function calls) with predictive upload, maximizing effective GPU utilization (boosting cache occupancy by ~17%) and reducing end-to-end latency by 47% (Bian et al., 21 Oct 2025).
5. Security, Consistency, and Correctness Considerations
Integrating shared cache mechanisms into multi-agent systems introduces new attack surfaces and necessitates robust correctness guarantees.
- KV-Cache Integrity and Tamper Detection: When latent agents transmit full KV-cache state (e.g., in latent communication protocols), the system must cryptographically bind each cache payload to session, specialist identity, payload digest, and visible commitment via HMAC-SHA256 manifests. This achieves 100% detection of tampered payloads versus pure text-based verification, which can be bypassed (Brito et al., 27 Jun 2026).
- Cross-Agent Consistency and Risk-Aware Reuse: For judge-centric tasks, naive KV reuse may degrade "judge consistency rate" (JCR), as block-diagonalization weakens cross-candidate interactions. Dedicated interaction-preserving strategies or conservative meta-reasoning gating are necessary to ensure correct candidate selection (Liang et al., 13 Jan 2026).
- Locking and Deadlock Avoidance in Non-LLM Domains: In multi-agent path-finding (e.g., L-MAPF-CM), shared cache cells (for items or paths) are managed with per-agent locking protocols and six-state machines, guaranteeing deadlock- and starvation-freedom even with aggressive agent concurrency (Tang et al., 6 Jan 2025).
6. Applications, Empirical Performance, and Scalability
Empirical results across a range of domains and agent architectures confirm the impact and trade-offs of cache sharing strategies:
- LLM Workloads:
- Asteria shows cache hit rates up to 85–95% with 3.6× throughput gain and 6× improvement in cost-per-request over non-semantic caching on search and code generation tasks (Ruan et al., 22 Sep 2025).
- PolyKV demonstrates >97% memory savings and near-constant perplexity (even improving as context grows) for up to 15 agents with shared compressed cache pools (Patel et al., 27 Apr 2026).
- TokenDance achieves up to 2.7× more agents serviced under SLO and 11–17× per-agent cache compression (Bian et al., 3 Apr 2026).
- Edge and On-Device Scenarios: QKVShare supports high agent density with controlled accuracy loss (≤3%) and handoff latencies hundreds of ms below re-prefill at 8K tokens (Honavar et al., 5 May 2026).
- RL and Caching Networks: AccMER achieves 20–25% training time reduction via windowed high-priority replay buffer sharing, significantly reducing cache/TLB misses with maintained learning performance in centralized multi-agent RL (Gogineni et al., 2023).
- Wireless Coded Caching: MARL-based coded cache sharing in wireless networks achieves up to 20% lower network load than static policies under nonuniform spatial demand, by learning contiguous, non-redundant codeword placements (Pedersen et al., 2021).
7. Limitations, Open Challenges, and Future Directions
Despite substantial advances, several areas remain active for research and refinement:
- Workflow and Domain Generality: Systems like PBKV target dynamic agent workflows, but broader adoption requires further decoupling from static DAGs and improved cross-agent semantic prediction (Zheng et al., 7 May 2026).
- Consistency vs. Reuse Trade-Offs: Empirically, high reuse may silently degrade decision invariance in judge-centric aggregation; safe deployment requires hybrid approaches that combine on-demand dense prefill with risk-aware, instance-level gating (Liang et al., 13 Jan 2026).
- Security: As cache-sharing extends to latent memory handoff, cryptographic payload binding and auditability become non-negotiable, with ongoing work on trusted execution and semantic anomaly detection (Brito et al., 27 Jun 2026).
- Overhead of Compression/Decompression: Lossily compressed pools (e.g., PolyKV) may add pipeline latency, and decompression cost, though amortized, should be benchmarked under high concurrency.
- Hardware and Heterogeneity: Techniques such as SwiftCache and Tokencake assume fast-server interconnects (NVLink), and further research is needed for heterogeneous, cloud-edge, and mobile deployments (Hu et al., 15 Jun 2026, Bian et al., 21 Oct 2025).
- Controller Adaptivity and Scalability: Many quantization and prediction heuristics require further ablations to robustly navigate high-depth, multi-hop, or dynamic-topology agent networks (Honavar et al., 5 May 2026).
Cache sharing across agents is now central to enabling efficient, scalable, and semantically rich multi-agent systems—spanning LLM inference at web and edge scale, reinforcement learning, distributed caching networks, embodied robotics, and beyond. The collective trajectory of the field is toward dynamic, prediction-driven, trust-preserving sharing protocols that adapt fluidly to agent workload, workflow structure, and infrastructure constraints.