Papers
Topics
Authors
Recent
2000 character limit reached

Cache-to-Cache Paradigm

Updated 28 November 2025
  • Cache-to-Cache paradigm is a set of techniques enabling direct, low-latency data sharing between caches in both neural network systems and multi-core architectures.
  • In LLM systems, it employs KV-cache fusion through neural fusers and alignment strategies to enhance semantic communication and reduce latency compared to token-based exchanges.
  • In hardware, it optimizes cache coherence by biasing LLC replacement policies, thereby lowering remote cache misses and improving overall system performance.

The Cache-to-Cache (C2C) paradigm encompasses multiple advanced techniques for high-efficiency, direct information transfer between distributed computational agents. In LLM systems, C2C enables semantic-level communication via direct sharing and fusion of internal key-value (KV) caches, eliminating the need for lossy token-based exchanges. In computer architecture, C2C designates optimized hardware protocols for managing coherence and localizing remote data transfers in multi-socket, multi-core systems. Both contexts embrace the central notion of bypassing conventional intermediate representations (text or DRAM) in favor of higher-bandwidth, lower-latency cache-to-cache pathways with application-specific optimization.

1. Semantic Motivation and Historical Background

C2C in LLM Systems

Traditional multi-LLM (“multi-agent”) communication relies on token-level exchanges: each LLM must serialize its internal activations into a linear text sequence, transmitting these tokens to its peer, which decodes and re-embeds them to reconstruct input context. This “text-to-text” (T2T) interface suffers several fundamental limitations (Fu et al., 3 Oct 2025):

  • Irrecoverable semantic loss, as the high-dimensional structure of internal representations (including structural, factual, and chain-of-reasoning signals) cannot be fully captured by token serialization.
  • Ambiguity, as natural language inherently admits underspecified references and idiosyncratic phrasing, resulting in misinterpretation on the receiving end.
  • Severe latency, since decoding proceeds strictly token-by-token, bottlenecking throughput and inflating inference time, especially on long messages.

C2C in Multi-Core Hardware

Multi-socket, multi-core server architectures maintain cache coherence via hardware protocols (e.g., MOESI) that frequently necessitate “remote cache-to-cache” transfers—moving cache lines directly between LLCs of different sockets when remote data is needed. These transfers avoid the full cost of remote DRAM but still incur significant inter-socket latency and interconnect pressure (Durbhakula, 2019). They present critical performance bottlenecks when remote data is frequently ping-ponged.

2. Mechanisms: Formal Definitions and Architectures

C2C for LLMs: Direct Semantic Fusion

The C2C paradigm for LLMs operationalizes a direct inter-model cache projection and fusion pipeline. Denote the input as X=[x0,,xn1]X = [x_{0}, \dots, x_{n-1}] with model MM’s per-layer KV-cache CM(X)={KlM(X),VlM(X)}l=1LC^M(X) = \{ K_l^M(X), V_l^M(X) \}_{l=1}^L:

  • Given Sharer model SS and Receiver model RR, align their layers (terminal alignment) and tokens (maximal-coverage mapping).
  • For each layer ll, fuse caches K~lR,V~lR\tilde{K}_l^R, \tilde{V}_l^R using a fuser FlF_l:

K~lR=FlK(KlR,Kg(l)S),V~lR=FlV(VlR,Vg(l)S)\tilde K_l^R = F_l^K(K_l^R, K_{g(l)}^S), \quad \tilde V_l^R = F_l^V(V_l^R, V_{g(l)}^S)

  • The fused cache CF(X)={(K~lR,V~lR)}l=1LC^F(X) = \{ (\tilde K_l^R, \tilde V_l^R) \}_{l=1}^L is used for autoregressive decoding without reverting back to tokens.

The fuser itself comprises projection with 1×1 convolutions (or linear layers), feature fusion (by concatenation and residual connection), dynamic per-head weighting, and a learnable per-layer gating mechanism (with Gumbel-sigmoid annealing). The final cache fusion per layer is:

K~l=gl(αlFK)+(1gl)KlR,V~l=gl(αlFV)+(1gl)VlR\tilde K_l = g_l \cdot (\alpha_l \odot F_K) + (1 - g_l) \cdot K_l^R,\quad \tilde V_l = g_l \cdot (\alpha_l \odot F_V) + (1 - g_l) \cdot V_l^R

where glg_l is a hard gate and αl\alpha_l are per-head scaling factors (Fu et al., 3 Oct 2025).

C2C for Hardware: Biased LLC Replacement for Remote Cache Transfers

C2C optimization in the hardware context involves dynamically biasing replacement policies to reduce costly evictions of cache lines loaded via remote cache-to-cache coherence. The mechanism is as follows (Durbhakula, 2019):

  • Each LLC line has an extra SharedBit, set to 1 if imported via a remote cache-to-cache protocol.
  • Per-cache set, two counters RLCjRLC_j, RRCjRRC_j track skipped evictions for local- and remote-home shared lines, with separate thresholds Tlocal,TremoteT_\text{local}, T_\text{remote} (typically, Tlocal=A/4T_\text{local} = A/4, Tremote=A/2T_\text{remote} = A/2 for associativity AA).
  • Replacement policy selects the LRU line unless it is remote-shared, in which case skip/evict decisions depend on counter status.

Formally, define a “stay-alive bonus” for line ii:

W(i)=Ri{λlocalHi=1 λremoteHi=0W(i) = R_i \cdot \begin{cases} \lambda_{\text{local}} & H_i = 1\ \lambda_{\text{remote}} & H_i = 0 \end{cases}

with Ri=1R_i = 1 for SharedBit set and HiH_i indicating home/locality.

3. Training, Optimization, and Runtime Adaptation

LLM C2C Fusion Training

The Sharer and Receiver LLMs remain frozen during C2C fusion training; only the fuser network parameters are optimized. The training objective is standard next-token cross-entropy on fused cache-augmented contexts:

L=ilogPR(yi+1CF(X)C(Y0:i))\mathcal{L} = -\sum_{i} \log P_R(y_{i+1} \mid C^F(X) \oplus C(Y_{0:i}))

Regularization includes weight decay (0.01), gradient clipping (1.0), and entropy penalties for gate discretization. Training is performed on high-diversity data (OpenHermes-2.5, MMLU auxiliary, LongBench E), with tuned schedules for batch, learning rate, and warmup (Fu et al., 3 Oct 2025).

Hardware C2C Policy Adaptation

To avoid unnecessary biasing, an adaptive controller tracks the remote miss fraction

RMF=#remote cache misses#total LLC misses\mathrm{RMF} = \frac{\#\text{remote cache misses}}{\#\text{total LLC misses}}

and enables/disables the biased replacement policy using high/low watermarks (HWM=0.5, LWM=0.1). This adaptivity ensures the hardware overhead is only incurred in sharing-intensive regimes (Durbhakula, 2019).

4. Experimental Evaluation and Benchmarking

LLM C2C: Accuracy and Latency Metrics

Extensive experiments across OpenBookQA, MMLU-Redux, ARC-Challenge, C-Eval, and LongBenchV1 demonstrate:

  • C2C yields +8.5–10.5% absolute accuracy over Receiver-only, and +3.0–5.0% over T2T communication.
  • Latency is reduced by approximately 2× (e.g., C2C: 0.40 s, T2T: 1.52 s on A100 GPU, batch=1).
  • C2C’s performance gap widens with larger Sharer models and longer contexts, recovering roughly 40% more of the performance lost to T2T in long-context settings (see Table 4, Figure 1).

Ablation studies further show:

  • Heterogeneous Sharer-Receiver pairings surpass self-communication and single-model tuning.
  • Each fuser component (feature fusion, gating) yields measurable accuracy gains.
  • Gate and dynamic weights modulate—which layers absorb external context, depending on general-purpose versus task-specific tuning.

Hardware C2C: Qualitative and Metric-Based Claims

While the principal evaluation is qualitative, the anticipated effects include:

  • Reductions in the remote miss fraction, overall LLC miss rate, and end-to-end execution time for sharing-intensive workloads.
  • Remote-shared line biasing is superior to traditional LRU in scenarios where remote data ping-pong is non-negligible, achieving better locality and lower latency without OS or software changes.
  • Quantitative metrics proposed for future work include RMF, LLC miss rates, wall-clock runtime, and cache-to-cache transfer counts (Durbhakula, 2019).
Context C2C Mechanism Reported Effects
LLM Systems KV-Cache fusion via neural fuser +8.5–10.5% accuracy; 2× faster than text comm.
Coherence HW Biased LLC replacement Lower RMF and execution time in sharing-heavy phase

5. Analyses, Interpretations, and Theoretical Implications

Behavioral experiments for LLM C2C show the cache fusion step increases the effective rank (intrinsic dimension) of cache tensors (e.g., KK: 388→395, VV: 532→560), indicating injected semantic richness. Layerwise studies reveal strong correlation between deeper-fused layers (closer to output) and downstream accuracy, with progressive replacement yielding monotonic gains. In heterogeneous pairings, C2C recovers over 72% of correct answers from the stronger Sharer, facilitating robust integration of complementary expertise (see Figure 2 and 8) (Fu et al., 3 Oct 2025).

For hardware C2C, the practical implication is that giving remote-shared cache lines higher eviction thresholds improves hit rates for expensive lines, aligning hardware resource allocation with access cost profiles. This form of “coherence-aware caching” exemplifies refinements possible when the structural semantics of cache traffic are explicitly quantified (Durbhakula, 2019).

6. Limitations, Trade-offs, and Future Directions

LLM C2C

Limitations include:

  • Fragility of token and layer alignment for highly disparate model architectures.
  • Inference-time compute overhead from fuser projections compared to single-model runs, though still favorable versus T2T latency.
  • Use of shallow (1–2-layer) fuser architectures; deeper or more structured fusers may yield further gains but increase complexity.

Potential advances include privacy-preserving communication (e.g., transmitting only gated cache slices) and multimodal fusion, merging caches across vision, language, or policy models (Fu et al., 3 Oct 2025).

Hardware C2C

The hardware scheme incurs minimal area overhead (1 bit per line, two counters per set) but adds moderate verification complexity. The approach is best suited to regimes where the working set fits in LLC and remote sharing is frequent; otherwise, coarse-grained methods such as OS-level page migration may dominate. Extension avenues involve quantitative evaluation with benchmarks (Graph500, memcached), online adaptation of thresholds, and combined OS–hardware strategies (Durbhakula, 2019).

7. Broader Context and Comparative Summary

The Cache-to-Cache paradigm, deployed in both neural and hardware contexts, exemplifies high-bandwidth, high-efficiency intra- and inter-agent communication through direct cache manipulation and dynamic adaptation. Within LLM systems, C2C represents a structured, differentiable, parallel semantic fusion mechanism significantly outperforming text-mediated collaboration in both accuracy and latency. In microarchitecture, C2C replacement biasing prioritizes access-locality for high-cost, remote-shared lines, directly reducing coherence-driven stalls without software intervention.

Across both domains, C2C shifts the communication and resource allocation bottleneck from external intermediates (tokens, DRAM) to internal representations or hardware-managed buffers, opening new directions for scalable, semantically enriched multi-agent and multicore performance (Fu et al., 3 Oct 2025, Durbhakula, 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cache-to-Cache (C2C) Paradigm.