Papers
Topics
Authors
Recent
Search
2000 character limit reached

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Published 3 Apr 2026 in cs.DC | (2604.03143v1)

Abstract: Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.

Summary

  • The paper introduces a round-aware prompt interface and collective KV cache reuse that reduces redundant compute and memory usage in multi-agent LLM serving.
  • It presents a Master-Mirror caching technique with diff-based storage that achieves up to 17× compression while preserving output fidelity.
  • Experimental results demonstrate enhanced agent throughput and lower latency compared to existing systems like vLLM under fixed GPU constraints.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

Motivation and Problem Formulation

Large-scale multi-agent LLM applications, notably in domains like social simulation and collaborative coding, operate by organizing execution into synchronized rounds. In each round, a central scheduler aggregates agent outputs and redistributes the context, manifesting the All-Gather communication pattern. This pattern precipitates massive redundancy in KV Caches: every agent's prompt incorporates the same shared output blocks, although their positions and order may differ due to each agent’s unique private history. Existing LLM serving systems do not efficiently exploit this redundancy, resulting in linear scaling of memory requirements with agent count. This severely restricts concurrency and scalability under fixed GPU memory budgets. Figure 1

Figure 1: The All-Gather prompt structure. All agents receive the same output blocks (OO), but the blocks appear at different positions because each prompt has its own private history (HH) and may use a different block order. This structure arises in any multi-agent application that follows the All-Gather pattern.

TokenDance addresses two orthogonal inefficiencies in current architectures:

  1. Redundant reuse computation: Position-independent caching (PIC) and related methods apply block matching and positional correction per request independently, incurring multiplicative compute cost.
  2. Redundant KV Cache storage: After reuse, each agent still maintains an entire cache copy, although the majority of their content is identical.

System Architecture and Mechanisms

Round-Aware Prompt Interface

TokenDance introduces a round-aware prompt interface where each logical block (private history, shared output, etc.) is delimited by reserved separator tokens (e.g., <TTSEP>), rendering block boundaries explicit post-tokenization. Block-level recognition enables segment-based hashing and matching, permitting efficient identification and management of shared content irrespective of its absolute position in agent-specific prompts. Figure 2

Figure 2: Example of TokenDance's round-aware prompt interface, with per-agent private history blocks and shared output blocks separated by <TTSEP> tokens.

Collective KV Cache Reuse

Traditional per-request PIC methods invoke RoPE rotation and important-position analysis for each agent, even on content-identical shared blocks—repeating computational work N times in an N-agent round. TokenDance’s KV Collector groups compatible requests (via alignment in length, visible cache span, and slot mapping), enabling collective processing:

  • Shared RoPE rotation and important-position selection are executed once for the entire round.
  • Only agent-specific private segments are handled individually.
  • The collective reuse method is agnostic to the underlying PIC backend. Figure 3

    Figure 3: Per-request PIC reuse (top) vs. TokenDance’s collective reuse (bottom). TokenDance groups requests and shares RoPE/important-position selection across the group, paying reuse overhead only once per round.

    Figure 4

    Figure 4: Collective KV Cache reuse for a three-agent All-Gather round. TokenDance shares RoPE and important-position selection tasks across the group, avoiding redundant per-request compute.

Diff-Aware Storage and Fused Restore

To compress storage, TokenDance employs a Master-Mirror layout:

  • A single agent's cache (the Master) is stored densely.
  • All others (Mirrors) are serialized as block-sparse differences (diffs), capturing only positions where private histories or placement cause divergence from the Master. Positions are typically clustered, supporting efficient block-sparse representation.

This design achieves $11$–17×17\times compression, with mirrors occupying only 6%6\%–9%9\% of a full cache in evaluation settings.

Restoration uses Fused Diff Restore: During the cache transfer pipeline, sparse corrections are applied inline to paged memory, obviating the need for a dense intermediate buffer and minimizing critical-path latency. Block-level alignment ensures cache corrections can be integrated efficiently without degrading attention throughput. Figure 5

Figure 5: Diff-aware storage with the Master-Mirror layout. One Master cache is stored densely; Mirrors are represented by sparse diffs. Mirrors are reconstructed on-demand from the Master and diffs.

Figure 6

Figure 6: Fused diff restore at block granularity. Only blocks with diffs undergo correction in shared memory before attention.

Experimental Results

Scalability and Latency

TokenDance’s principal metric of success is the number of concurrently served agents at a fixed latency or QPS SLO.

  • On the GenerativeAgents and AgentSociety benchmarks with Qwen2.5-7B and Qwen2.5-14B, TokenDance supports up to 2.7×2.7\times more concurrent agents below the $1500$ms SLO relative to vLLM with prefix caching.
  • With larger models, TokenDance's deduplication shows greater absolute gains due to the higher per-agent cache footprint. Figure 7

    Figure 7: Scaling capacity overview across two workloads and models. TokenDance consistently achieves higher agent counts within latency constraints than all baselines.

Prefill Speedup and Redundancy

  • TokenDance achieves up to 1.9×1.9\times prefill throughput speedup compared to per-request PIC recovery methods such as CacheBlend, with collective reuse amortizing the computational overhead across agents.
  • Compression ratios for the diff-aware storage reach 17.5×17.5\times on the 14B model, with most per-agent caches exhibiting only minor block-level divergence from the Master. Figure 8

    Figure 8: Collective KV Cache reuse speedup over serial PIC recovery shows consistent improvement with varying agent counts and QPS.

    Figure 9

    Figure 9: Redundancy characterization showing high compression ratios and modest numbers of changed blocks per Mirror.

Restore Latency

  • Fused diff retrieval reduces restore latency by HH0–HH1 compared to naive dense restore approaches, confirming the practical viability of block-sparse compression at inference time. Figure 10

    Figure 10: Latency analysis for Mirror state reconstruction; fused diff retrieval consistently outperforms dense restoration.

Accuracy Assessment

  • Under greedy decoding with temperature zero, TokenDance introduces no additional accuracy degradation compared to the underlying PIC method, with any divergences directly attributable to the nuance of selective recomputation in that backend rather than TokenDance’s collective or compressed architecture. Figure 11

    Figure 11: Accuracy evaluation finds minimal divergence in output between TokenDance and standard vLLM serving; remaining differences are inherited from the underlying PIC backend.

Implications and Future Directions

TokenDance demonstrates that communication structure (notably, the All-Gather pattern) is a valuable abstraction for system-level LLM optimization. Its architectural paradigm—collective KV Cache reuse paired with cross-cache redundancy exploitation—substantially enhances scalability and efficiency for multi-agent workloads. Practically, this allows agent-based simulations, collaborative tools, and multi-user LLM applications to accommodate significantly larger populations under fixed-memory hardware constraints.

From a theoretical perspective, the work suggests that future LLM serving systems should incorporate communication pattern awareness as a first-class system concept, generalizing beyond All-Gather to other collective and point-to-multipoint patterns as agent systems diversify. Diff-based and collective architectures are amenable to composition with further optimizations such as quantization and cross-model sharing, indicating a trajectory toward hybridized, communication-structure-aware serving stacks.

Conclusion

TokenDance presents a principled system for maximizing concurrent agent capacity in multi-agent LLM serving by exploiting collective context redundancy at both compute and memory levels. Through round-aware interfaces, collective reuse mechanisms, and diff-based cache compression with fused restoration, TokenDance delivers substantial improvements in agent scalability, operational latency, and memory efficiency without compromising output fidelity. This approach underlines the critical role of communication structure in efficient LLM serving and sets a foundation for subsequent research into broader communication-aware optimizations in AI system infrastructure.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.