Papers
Topics
Authors
Recent
2000 character limit reached

Cache-to-Cache: Direct Semantic Communication Between Large Language Models (2510.03215v1)

Published 3 Oct 2025 in cs.CL and cs.LG

Abstract: Multi-LLM systems harness the complementary strengths of diverse LLMs, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Summary

  • The paper demonstrates that direct transfer of KV-cache representations improves LLM accuracy by 8.5-10.5% over traditional text-based methods.
  • The proposed C2C methodology leverages a learnable Fuser that integrates and aligns caches, reducing inference latency by 2x.
  • C2C maintains robust performance across heterogeneous LLMs and tasks, illustrating its potential for efficient multi-agent communication.

Cache-to-Cache: Direct Semantic Communication Between LLMs

Motivation and Problem Statement

Multi-LLM systems are increasingly deployed to leverage the complementary strengths of diverse LLMs, including models specialized for code, mathematics, or general reasoning. The prevailing paradigm for inter-model communication is text-to-text (T2T), where models exchange information via explicit text generation and consumption. This approach is fundamentally limited by the lossy compression of high-dimensional internal representations into linear token sequences, inherent ambiguities of natural language, and the sequential latency of token-by-token decoding. These constraints motivate the investigation of direct, non-textual communication channels between LLMs.

Cache-to-Cache (C2C) Paradigm

The paper introduces Cache-to-Cache (C2C), a novel paradigm for direct semantic communication between LLMs. C2C enables one model (the Sharer) to transmit its internal key-value (KV) cache representations directly to another model (the Receiver), bypassing the need for intermediate text. The core hypothesis is that the KV-Cache encodes richer, less ambiguous, and more efficiently transferable semantic information than text.

Oracle Experiments

Two key oracle experiments underpin the C2C design:

  1. Cache Enrichment Oracle: Demonstrates that enriching the KV-Cache (e.g., via few-shot exemplars) improves response quality without increasing cache size, indicating that the cache itself is a potent medium for semantic transfer.
  2. Cache Transformation Oracle: Shows that a learned mapping (e.g., a 3-layer MLP) can project the KV-Cache from a source model into the representational space of a target model, enabling cross-model cache utilization.

These findings establish that (a) cache enrichment yields measurable accuracy gains, and (b) KV-Caches are convertible between models, albeit with some loss due to representational differences.

C2C Architecture and Implementation

Fuser Module

The C2C Fuser is the central component responsible for integrating the Sharer's and Receiver's KV-Caches. Its design is guided by the following principles:

  • Residual Integration: The Fuser concatenates the Sharer's and Receiver's caches, projects them into a shared space, and fuses them via a feature fusion layer, preserving the Receiver's original information.
  • Dynamic Weighting: An input-aware head modulation layer dynamically reweights the contribution of the Sharer's cache on a per-query basis.
  • Learnable Gating: A per-layer, trainable gate (using Gumbel-sigmoid annealing) determines which layers benefit from cache fusion, enabling selective semantic transfer.

Alignment Strategies

  • Token Alignment: Handles tokenizer discrepancies by re-encoding Receiver tokens with the Sharer's tokenizer, using maximal string coverage for one-to-many mappings.
  • Layer Alignment: Employs terminal alignment, pairing the deepest layers of both models and proceeding in reverse order, ensuring that high-level semantic representations are aligned.

Training Protocol

  • Both Sharer and Receiver models are frozen; only the C2C Fuser is trained.
  • Training uses standard next-token prediction loss, with the Receiver conditioned on the fused cache.
  • The training pipeline consists of forward pass (cache extraction), fusion (cache replacement), and supervised prediction (gradient backpropagation through the Fuser).

Empirical Results

Performance and Efficiency

C2C consistently outperforms both individual models and T2T communication across a range of benchmarks (OpenBookQA, MMLU-Redux, ARC-Challenge, C-Eval):

  • Accuracy Gains: C2C achieves 8.5–10.5% higher average accuracy than individual models and 3.0–5.0% higher than T2T communication.
  • Latency: C2C delivers a 2x speedup in inference latency compared to T2T, due to the elimination of sequential text generation.
  • Scalability: C2C's gains increase with the Sharer's model size, and it generalizes across model families, sizes, and specializations.

Ablation and Analysis

  • Fuser Design: Residual fusion and gating are critical; naive projection yields substantially lower accuracy.
  • Effective Rank: C2C increases the effective rank of the Receiver's cache, indicating a richer semantic space post-fusion.
  • Progressive Fusion: Increasing the proportion of fused cache improves performance, especially when fusing deeper layers.
  • Generalization: C2C is robust to model swaps and heterogeneous model pairs, and does not simply overfit to the training set.

Case Study

In a physics question involving Coulomb's law, both the Sharer and Receiver alone fail, and T2T communication is insufficient due to ambiguous text. C2C enables the Receiver to leverage the Sharer's contextual understanding, resulting in the correct answer—demonstrating the practical value of direct cache transfer.

Theoretical and Practical Implications

C2C challenges the assumption that natural language is the optimal interface for LLM collaboration. By leveraging internal representations, it enables higher-bandwidth, lower-latency, and less ambiguous communication. This paradigm has several implications:

  • Multi-LLM Systems: C2C can be integrated into collaborative agent frameworks, enabling richer and more efficient inter-agent communication.
  • Inference Acceleration: C2C can enhance speculative decoding and token-level routing, reducing computational cost and latency.
  • Privacy and Edge Deployment: Direct cache transfer allows for privacy-preserving cloud-edge collaboration, as only internal representations (not raw text) are transmitted.
  • Multimodal Fusion: The approach can be extended to fuse caches across language, vision, and action models, facilitating more integrated multimodal systems.

Limitations and Future Directions

  • Representational Gaps: The transformed cache occupies only a subset of the target model's space, indicating incomplete semantic transfer.
  • Model Compatibility: While C2C generalizes across families and sizes, extreme architectural differences may require more sophisticated alignment.
  • Security and Interpretability: Direct cache transfer raises new questions about information leakage and the interpretability of fused representations.

Future work may explore more complex Fuser architectures, dynamic multi-agent protocols, and extensions to multimodal and cross-lingual settings.

Conclusion

Cache-to-Cache (C2C) establishes a new paradigm for direct, semantic communication between LLMs by fusing internal KV-Cache representations. Empirical results demonstrate that C2C yields higher accuracy and lower latency than text-based communication, with robust generalization across models and tasks. This work opens new avenues for efficient, high-fidelity multi-LLM systems and challenges the primacy of natural language as the sole medium for inter-model communication.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 50 tweets with 2461 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com