Cache-to-Cache Paradigm
- Cache-to-Cache paradigm is a set of techniques enabling direct, low-latency data sharing between caches in both neural network systems and multi-core architectures.
- In LLM systems, it employs KV-cache fusion through neural fusers and alignment strategies to enhance semantic communication and reduce latency compared to token-based exchanges.
- In hardware, it optimizes cache coherence by biasing LLC replacement policies, thereby lowering remote cache misses and improving overall system performance.
The Cache-to-Cache (C2C) paradigm encompasses multiple advanced techniques for high-efficiency, direct information transfer between distributed computational agents. In LLM systems, C2C enables semantic-level communication via direct sharing and fusion of internal key-value (KV) caches, eliminating the need for lossy token-based exchanges. In computer architecture, C2C designates optimized hardware protocols for managing coherence and localizing remote data transfers in multi-socket, multi-core systems. Both contexts embrace the central notion of bypassing conventional intermediate representations (text or DRAM) in favor of higher-bandwidth, lower-latency cache-to-cache pathways with application-specific optimization.
1. Semantic Motivation and Historical Background
C2C in LLM Systems
Traditional multi-LLM (“multi-agent”) communication relies on token-level exchanges: each LLM must serialize its internal activations into a linear text sequence, transmitting these tokens to its peer, which decodes and re-embeds them to reconstruct input context. This “text-to-text” (T2T) interface suffers several fundamental limitations (Fu et al., 3 Oct 2025):
- Irrecoverable semantic loss, as the high-dimensional structure of internal representations (including structural, factual, and chain-of-reasoning signals) cannot be fully captured by token serialization.
- Ambiguity, as natural language inherently admits underspecified references and idiosyncratic phrasing, resulting in misinterpretation on the receiving end.
- Severe latency, since decoding proceeds strictly token-by-token, bottlenecking throughput and inflating inference time, especially on long messages.
C2C in Multi-Core Hardware
Multi-socket, multi-core server architectures maintain cache coherence via hardware protocols (e.g., MOESI) that frequently necessitate “remote cache-to-cache” transfers—moving cache lines directly between LLCs of different sockets when remote data is needed. These transfers avoid the full cost of remote DRAM but still incur significant inter-socket latency and interconnect pressure (Durbhakula, 2019). They present critical performance bottlenecks when remote data is frequently ping-ponged.
2. Mechanisms: Formal Definitions and Architectures
C2C for LLMs: Direct Semantic Fusion
The C2C paradigm for LLMs operationalizes a direct inter-model cache projection and fusion pipeline. Denote the input as with model ’s per-layer KV-cache :
- Given Sharer model and Receiver model , align their layers (terminal alignment) and tokens (maximal-coverage mapping).
- For each layer , fuse caches using a fuser :
- The fused cache is used for autoregressive decoding without reverting back to tokens.
The fuser itself comprises projection with 1×1 convolutions (or linear layers), feature fusion (by concatenation and residual connection), dynamic per-head weighting, and a learnable per-layer gating mechanism (with Gumbel-sigmoid annealing). The final cache fusion per layer is:
where is a hard gate and are per-head scaling factors (Fu et al., 3 Oct 2025).
C2C for Hardware: Biased LLC Replacement for Remote Cache Transfers
C2C optimization in the hardware context involves dynamically biasing replacement policies to reduce costly evictions of cache lines loaded via remote cache-to-cache coherence. The mechanism is as follows (Durbhakula, 2019):
- Each LLC line has an extra SharedBit, set to 1 if imported via a remote cache-to-cache protocol.
- Per-cache set, two counters , track skipped evictions for local- and remote-home shared lines, with separate thresholds (typically, , for associativity ).
- Replacement policy selects the LRU line unless it is remote-shared, in which case skip/evict decisions depend on counter status.
Formally, define a “stay-alive bonus” for line :
with for SharedBit set and indicating home/locality.
3. Training, Optimization, and Runtime Adaptation
LLM C2C Fusion Training
The Sharer and Receiver LLMs remain frozen during C2C fusion training; only the fuser network parameters are optimized. The training objective is standard next-token cross-entropy on fused cache-augmented contexts:
Regularization includes weight decay (0.01), gradient clipping (1.0), and entropy penalties for gate discretization. Training is performed on high-diversity data (OpenHermes-2.5, MMLU auxiliary, LongBench E), with tuned schedules for batch, learning rate, and warmup (Fu et al., 3 Oct 2025).
Hardware C2C Policy Adaptation
To avoid unnecessary biasing, an adaptive controller tracks the remote miss fraction
and enables/disables the biased replacement policy using high/low watermarks (HWM=0.5, LWM=0.1). This adaptivity ensures the hardware overhead is only incurred in sharing-intensive regimes (Durbhakula, 2019).
4. Experimental Evaluation and Benchmarking
LLM C2C: Accuracy and Latency Metrics
Extensive experiments across OpenBookQA, MMLU-Redux, ARC-Challenge, C-Eval, and LongBenchV1 demonstrate:
- C2C yields +8.5–10.5% absolute accuracy over Receiver-only, and +3.0–5.0% over T2T communication.
- Latency is reduced by approximately 2× (e.g., C2C: 0.40 s, T2T: 1.52 s on A100 GPU, batch=1).
- C2C’s performance gap widens with larger Sharer models and longer contexts, recovering roughly 40% more of the performance lost to T2T in long-context settings (see Table 4, Figure 1).
Ablation studies further show:
- Heterogeneous Sharer-Receiver pairings surpass self-communication and single-model tuning.
- Each fuser component (feature fusion, gating) yields measurable accuracy gains.
- Gate and dynamic weights modulate—which layers absorb external context, depending on general-purpose versus task-specific tuning.
Hardware C2C: Qualitative and Metric-Based Claims
While the principal evaluation is qualitative, the anticipated effects include:
- Reductions in the remote miss fraction, overall LLC miss rate, and end-to-end execution time for sharing-intensive workloads.
- Remote-shared line biasing is superior to traditional LRU in scenarios where remote data ping-pong is non-negligible, achieving better locality and lower latency without OS or software changes.
- Quantitative metrics proposed for future work include RMF, LLC miss rates, wall-clock runtime, and cache-to-cache transfer counts (Durbhakula, 2019).
| Context | C2C Mechanism | Reported Effects |
|---|---|---|
| LLM Systems | KV-Cache fusion via neural fuser | +8.5–10.5% accuracy; 2× faster than text comm. |
| Coherence HW | Biased LLC replacement | Lower RMF and execution time in sharing-heavy phase |
5. Analyses, Interpretations, and Theoretical Implications
Behavioral experiments for LLM C2C show the cache fusion step increases the effective rank (intrinsic dimension) of cache tensors (e.g., : 388→395, : 532→560), indicating injected semantic richness. Layerwise studies reveal strong correlation between deeper-fused layers (closer to output) and downstream accuracy, with progressive replacement yielding monotonic gains. In heterogeneous pairings, C2C recovers over 72% of correct answers from the stronger Sharer, facilitating robust integration of complementary expertise (see Figure 2 and 8) (Fu et al., 3 Oct 2025).
For hardware C2C, the practical implication is that giving remote-shared cache lines higher eviction thresholds improves hit rates for expensive lines, aligning hardware resource allocation with access cost profiles. This form of “coherence-aware caching” exemplifies refinements possible when the structural semantics of cache traffic are explicitly quantified (Durbhakula, 2019).
6. Limitations, Trade-offs, and Future Directions
LLM C2C
Limitations include:
- Fragility of token and layer alignment for highly disparate model architectures.
- Inference-time compute overhead from fuser projections compared to single-model runs, though still favorable versus T2T latency.
- Use of shallow (1–2-layer) fuser architectures; deeper or more structured fusers may yield further gains but increase complexity.
Potential advances include privacy-preserving communication (e.g., transmitting only gated cache slices) and multimodal fusion, merging caches across vision, language, or policy models (Fu et al., 3 Oct 2025).
Hardware C2C
The hardware scheme incurs minimal area overhead (1 bit per line, two counters per set) but adds moderate verification complexity. The approach is best suited to regimes where the working set fits in LLC and remote sharing is frequent; otherwise, coarse-grained methods such as OS-level page migration may dominate. Extension avenues involve quantitative evaluation with benchmarks (Graph500, memcached), online adaptation of thresholds, and combined OS–hardware strategies (Durbhakula, 2019).
7. Broader Context and Comparative Summary
The Cache-to-Cache paradigm, deployed in both neural and hardware contexts, exemplifies high-bandwidth, high-efficiency intra- and inter-agent communication through direct cache manipulation and dynamic adaptation. Within LLM systems, C2C represents a structured, differentiable, parallel semantic fusion mechanism significantly outperforming text-mediated collaboration in both accuracy and latency. In microarchitecture, C2C replacement biasing prioritizes access-locality for high-cost, remote-shared lines, directly reducing coherence-driven stalls without software intervention.
Across both domains, C2C shifts the communication and resource allocation bottleneck from external intermediates (tokens, DRAM) to internal representations or hardware-managed buffers, opening new directions for scalable, semantically enriched multi-agent and multicore performance (Fu et al., 3 Oct 2025, Durbhakula, 2019).