Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Published 3 Oct 2025 in cs.CL and cs.LG | (2510.03215v1)

Abstract: Multi-LLM systems harness the complementary strengths of diverse LLMs, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that direct transfer of KV-cache representations improves LLM accuracy by 8.5-10.5% over traditional text-based methods.
The proposed C2C methodology leverages a learnable Fuser that integrates and aligns caches, reducing inference latency by 2x.
C2C maintains robust performance across heterogeneous LLMs and tasks, illustrating its potential for efficient multi-agent communication.

Cache-to-Cache: Direct Semantic Communication Between LLMs

Motivation and Problem Statement

Multi-LLM systems are increasingly deployed to leverage the complementary strengths of diverse LLMs, including models specialized for code, mathematics, or general reasoning. The prevailing paradigm for inter-model communication is text-to-text (T2T), where models exchange information via explicit text generation and consumption. This approach is fundamentally limited by the lossy compression of high-dimensional internal representations into linear token sequences, inherent ambiguities of natural language, and the sequential latency of token-by-token decoding. These constraints motivate the investigation of direct, non-textual communication channels between LLMs.

Cache-to-Cache (C2C) Paradigm

The paper introduces Cache-to-Cache (C2C), a novel paradigm for direct semantic communication between LLMs. C2C enables one model (the Sharer) to transmit its internal key-value (KV) cache representations directly to another model (the Receiver), bypassing the need for intermediate text. The core hypothesis is that the KV-Cache encodes richer, less ambiguous, and more efficiently transferable semantic information than text.

Oracle Experiments

Two key oracle experiments underpin the C2C design:

Cache Enrichment Oracle: Demonstrates that enriching the KV-Cache (e.g., via few-shot exemplars) improves response quality without increasing cache size, indicating that the cache itself is a potent medium for semantic transfer.
Cache Transformation Oracle: Shows that a learned mapping (e.g., a 3-layer MLP) can project the KV-Cache from a source model into the representational space of a target model, enabling cross-model cache utilization.

These findings establish that (a) cache enrichment yields measurable accuracy gains, and (b) KV-Caches are convertible between models, albeit with some loss due to representational differences.

C2C Architecture and Implementation

Fuser Module

The C2C Fuser is the central component responsible for integrating the Sharer's and Receiver's KV-Caches. Its design is guided by the following principles:

Residual Integration: The Fuser concatenates the Sharer's and Receiver's caches, projects them into a shared space, and fuses them via a feature fusion layer, preserving the Receiver's original information.
Dynamic Weighting: An input-aware head modulation layer dynamically reweights the contribution of the Sharer's cache on a per-query basis.
Learnable Gating: A per-layer, trainable gate (using Gumbel-sigmoid annealing) determines which layers benefit from cache fusion, enabling selective semantic transfer.

Alignment Strategies

Token Alignment: Handles tokenizer discrepancies by re-encoding Receiver tokens with the Sharer's tokenizer, using maximal string coverage for one-to-many mappings.
Layer Alignment: Employs terminal alignment, pairing the deepest layers of both models and proceeding in reverse order, ensuring that high-level semantic representations are aligned.

Training Protocol

Both Sharer and Receiver models are frozen; only the C2C Fuser is trained.
Training uses standard next-token prediction loss, with the Receiver conditioned on the fused cache.
The training pipeline consists of forward pass (cache extraction), fusion (cache replacement), and supervised prediction (gradient backpropagation through the Fuser).

Empirical Results

Performance and Efficiency

C2C consistently outperforms both individual models and T2T communication across a range of benchmarks (OpenBookQA, MMLU-Redux, ARC-Challenge, C-Eval):

Accuracy Gains: C2C achieves 8.5–10.5% higher average accuracy than individual models and 3.0–5.0% higher than T2T communication.
Latency: C2C delivers a 2x speedup in inference latency compared to T2T, due to the elimination of sequential text generation.
Scalability: C2C's gains increase with the Sharer's model size, and it generalizes across model families, sizes, and specializations.

Ablation and Analysis

Fuser Design: Residual fusion and gating are critical; naive projection yields substantially lower accuracy.
Effective Rank: C2C increases the effective rank of the Receiver's cache, indicating a richer semantic space post-fusion.
Progressive Fusion: Increasing the proportion of fused cache improves performance, especially when fusing deeper layers.
Generalization: C2C is robust to model swaps and heterogeneous model pairs, and does not simply overfit to the training set.

Case Study

In a physics question involving Coulomb's law, both the Sharer and Receiver alone fail, and T2T communication is insufficient due to ambiguous text. C2C enables the Receiver to leverage the Sharer's contextual understanding, resulting in the correct answer—demonstrating the practical value of direct cache transfer.

Theoretical and Practical Implications

C2C challenges the assumption that natural language is the optimal interface for LLM collaboration. By leveraging internal representations, it enables higher-bandwidth, lower-latency, and less ambiguous communication. This paradigm has several implications:

Multi-LLM Systems: C2C can be integrated into collaborative agent frameworks, enabling richer and more efficient inter-agent communication.
Inference Acceleration: C2C can enhance speculative decoding and token-level routing, reducing computational cost and latency.
Privacy and Edge Deployment: Direct cache transfer allows for privacy-preserving cloud-edge collaboration, as only internal representations (not raw text) are transmitted.
Multimodal Fusion: The approach can be extended to fuse caches across language, vision, and action models, facilitating more integrated multimodal systems.

Limitations and Future Directions

Representational Gaps: The transformed cache occupies only a subset of the target model's space, indicating incomplete semantic transfer.
Model Compatibility: While C2C generalizes across families and sizes, extreme architectural differences may require more sophisticated alignment.
Security and Interpretability: Direct cache transfer raises new questions about information leakage and the interpretability of fused representations.

Future work may explore more complex Fuser architectures, dynamic multi-agent protocols, and extensions to multimodal and cross-lingual settings.

Conclusion

Cache-to-Cache (C2C) establishes a new paradigm for direct, semantic communication between LLMs by fusing internal KV-Cache representations. Empirical results demonstrate that C2C yields higher accuracy and lower latency than text-based communication, with robust generalization across models and tasks. This work opens new avenues for efficient, high-fidelity multi-LLM systems and challenges the primacy of natural language as the sole medium for inter-model communication.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (7)

Collections

GitHub

GitHub - thu-nics/C2C: The official code implementation for "Cache-to-Cache: Direct Semantic Communication Between Large Language Models" (3 stars)

Tweets

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Summary

Cache-to-Cache: Direct Semantic Communication Between LLMs

Motivation and Problem Statement

Cache-to-Cache (C2C) Paradigm

Oracle Experiments

C2C Architecture and Implementation

Fuser Module

Alignment Strategies

Training Protocol

Empirical Results

Performance and Efficiency

Ablation and Analysis

Case Study

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (7)

Collections

GitHub

Tweets

YouTube

HackerNews

Reddit

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Summary

Cache-to-Cache: Direct Semantic Communication Between LLMs

Motivation and Problem Statement

Cache-to-Cache (C2C) Paradigm

Oracle Experiments

C2C Architecture and Implementation

Fuser Module

Alignment Strategies

Training Protocol

Empirical Results

Performance and Efficiency

Ablation and Analysis

Case Study

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

GitHub

Tweets

YouTube

HackerNews

Reddit

alphaXiv

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research