Cache-to-Cache (C2C) Mechanisms
- Cache-to-Cache (C2C) is a communication mechanism that directly transfers data between caches in both hardware and LLM environments, reducing latency and bypassing traditional memory bottlenecks.
- In hardware systems, C2C leverages cache coherence protocols to share remote cache lines directly between cores, optimizing memory access and throughput.
- In LLM systems, C2C enables the fusion of key-value caches between models through neural adapters, yielding significant improvements in accuracy and speed.
Cache-to-Cache (C2C) refers to a class of mechanisms enabling the direct transfer, fusion, or exchange of information at the cache level between distributed computation entities. The term is most established in two distinct but technically analogous domains: (1) hardware systems, where remote cache-to-cache transfers minimize DRAM accesses to improve throughput in multi-socket servers, and (2) LLM systems, where direct transfer or fusion of key-value (KV) caches between models enables efficient, semantically enriched multi-LLM communication without the overheads of token-by-token text generation. Across both domains, C2C is defined by its focus on optimizing or repurposing cache state to enhance architectural efficiency or inter-agent interaction.
1. Definitions and Core Principles
In hardware, a remote cache-to-cache (C2C) transfer occurs when a core in socket X requests a memory line not present locally, and the coherence protocol supplies it directly from another socket's cache, bypassing DRAM; this results in lower latency, as the data is retrieved at the speed of inter-socket cache links rather than main memory. The data item is marked “shared-from-remote” in the last-level cache (LLC) with an explicit remote-shared flag (Durbhakula, 2019).
In the context of LLMs, Cache-to-Cache (C2C) communication denotes the direct transfer and learned fusion of deep, context-rich KV-cache tensors from a “Sharer” LLM into the working cache of a “Receiver” LLM, circumventing conventional text-based communication. C2C relies on neural adapters that align, project, and combine the per-layer per-token key and value tensors across models, optionally guided by trainable gating, yielding accuracy and efficiency gains relative to token-level exchange (Fu et al., 3 Oct 2025).
2. Cache-to-Cache in Hardware Multiprocessors
In a cache-coherent NUMA (ccNUMA) multi-socket, multi-core server, each socket features multiple private L1/L2 caches and a shared, inclusive LLC. The MOESI protocol governs coherence, allowing transfer of modified (M) and owned (O) lines directly between LLCs without DRAM roundtrips (Durbhakula, 2019). The access path is:
- Core → L1 → L2 → LLC (local socket)
- If not found, → LLC (remote socket) via C2C
- If not present remotely, → local or remote DRAM
The C2C transfer specifically refers to step (2), in which latency is minimized and bandwidth is preserved. Optimizing C2C behavior in the LLC has substantial impact on server workload performance, particularly for sharing-intensive applications such as OLTP databases or producer–consumer kernels.
A key mechanism is the marking and preferential retention of lines loaded via remote C2C transfers (using an “R” bit in each LLC line), alongside per-set counters tracking remote-hot lines. The LLC replacement policy is then biased to extend the residency of remote-shared lines—parameterized by thresholds and , where is set associativity (Durbhakula, 2019).
3. Cache-to-Cache for LLM Communication
C2C operates at a higher abstraction in machine learning systems: it enables LLMs to communicate directly via internal KV-caches—layer-wise matrices of key (K) and value (V) tensors constructed during Transformer forward passes. Suppose Sharer S and Receiver R are transformer-based LLMs, each layer and position yielding KV tuples . The stack of such tensors over all layers and positions is the model’s KV-cache.
Given two models presented with the same prompt, the C2C approach defines trainable fusion operators that ingest S’s (frozen) KV-cache for layer , together with R’s KV-cache for , to generate new fused caches:
Layer alignment 0 can be direct or terminal (last-layer-to-last-layer) mapping (Fu et al., 3 Oct 2025). The projection and fusion operators are typically MLPs preceded by concatenation and down-projection, with learnable gates selecting which layers admit fusion. Training optimizes only the fusion adapters using next-token prediction as objective, keeping Sharer and Receiver weights fixed.
This direct cache-level communication avoids the need for serial token generation and encoding—reducing latency and loss of semantically rich representation.
4. Quantitative Evaluation and Comparative Results
C2C’s benefits in both hardware and LLM domains are substantiated by empirical metrics:
Hardware Multiprocessors
The expected reduction in remote C2C misses in the LLC is modeled as:
1
where 2 is tracking accuracy, 3 is the fraction of truly remote-shared lines, and 4 is the bias effectiveness. Each avoided miss in this regime directly improves AMAT (average memory access time) and leads to throughput increases quantified by the number of C2C events saved and the inter-socket latency saved relative to DRAM (Durbhakula, 2019).
Comparative attributes of LLC replacement biasing against alternatives (e.g., DRAM-based caches, OS page migration) are summarized:
| Metric | LLC Replacement Bias | Remote-Access-Cache | OS Page Migration | OS Scheduling |
|---|---|---|---|---|
| Ease of adoption | No SW changes | No SW changes | Requires OS | Requires scheduler |
| Flexibility | Dynamic on/off | Static DRAM | Configurable | Heuristic tuning |
| Verification | Moderate | High | Low | Low–mod |
For small-to-medium remote-shared working sets, the biasing strategy is observed to be more effective at retaining hot remote lines than static hardware DRAM caches.
LLM Cache-to-Cache Communication
C2C enables substantial zero-shot accuracy gains over receiver-alone baselines and standard text-to-text (T2T) LLM communication:
- Average accuracy improvement vs. Receiver: 5 to 6
- Gain over T2T: 7 to 8
- Speedup: 9 lower latency relative to T2T, as intermediate token decoding is bypassed (Fu et al., 3 Oct 2025)
Extending C2C, compressed fusion via Latent Cache Flow (LCF) replaces per-type full-width fusers with a single shared projector, reducing adapter size from 956 MB (C2C) to 0 MB (LCF) while matching or exceeding accuracy (compression ratio up to 24.61). LCF-X, further, accommodates mismatched source/receiver contexts via learnable span pooling and global summary fusion (Rossi et al., 19 May 2026).
Results on benchmarks (e.g., MMLU Redux, HotpotQA-bridge) show LCF-X outperforming C2C and all text-based baselines in both accuracy (up to 2 in Exact Match) and latency (up to 3 reduction in time-to-first-token).
5. Algorithmic and Architectural Details
Hardware C2C Algorithm (LLC Replacement Biasing)
- Each LLC block includes an R (remote-shared) flag.
- Per-set counters 4, 5 track the number of protected remote lines from local/remote home sockets.
- Eviction pseudocode ensures lines with 6 are preferentially retained until counters hit their respective thresholds, at which point bias is suspended and lines are evicted in LRU order; biasing is dynamically toggled by remote-miss fraction over time windows (Durbhakula, 2019).
LLM C2C/LCF Adapter Architecture
- Inputs: Sharer/Receiver per-layer KV caches (7, 8, 9, 0).
- C2C fusers: Concatenate per-token tensors; project to receiver hidden size; fuse and gate by per-layer Gumbel-sigmoids.
- LCF: Jointly compress keys/values via a down-project/MLP pipeline, derive residual updates and broadcast gated deltas.
- LCF-X: Handle context misalignment by span-based, pooled key/value summaries, globally fused and injected into the receiver cache.
All adapter parameters are trained with AdamW (learning rate 1), next-token cross-entropy (language tasks) or QA cross-entropy (cross-context), with all LLM weights frozen (Fu et al., 3 Oct 2025, Rossi et al., 19 May 2026).
6. Extensions, Benchmarks, and Future Directions
Extensions
- Hardware: Auto-tuning of per-set thresholds, multi-bit aging flags to differentiate long-lived versus transient remote lines, modeling performance–energy tradeoffs (Durbhakula, 2019).
- LLMs: Adapters can be pruned—ablation studies show only a minority of LCF adapter layers are critical, providing a further 2 reduction in adapter size. Span-pooling inference can be varied arbitrarily without retraining.
Benchmarks
- Hardware: Benchmarks for evaluating C2C include SPLASH-2, PARSEC, and OLTP workloads.
- LLMs: MMLU-Redux, ARC-C, OBQA, MMLU-Pro for multiple-choice; HotpotQA-bridge for QA under different context regimes.
Metrics
- Hardware: Remote-miss fraction, AMAT, throughput (requests/second), energy per access.
- LLMs: Zero-shot accuracy, F1 / Exact Match (QA), time-to-first-token (TTFT), adapter (parameter) size, compression ratio.
Directions
A plausible implication is that Cache-to-Cache paradigms will enable multi-agent and multi-model systems to transfer rich, situation-specific knowledge efficiently, with the potential for application in distributed AI and hybrid computing. For hardware, extensions to heterogenous systems and adaptive policies may further increase efficiency under variable workloads.
7. Distinction from Related Concepts and Limitations
C2C (hardware) should not be conflated with OS-level page migration or cache partitioning; it provides a microarchitectural, hardware-based mechanism with no software involvement for dynamic, fine-grained retention of remote cache lines.
In LLM systems, C2C (and its successors like LCF) operate directly on internal model states, superseding prior paradigms where LLM communication was strictly mediated by observed text. Notably, C2C adapters require model architectures to be compatible at the tensor and sequence level, and adapter overheads, while now substantially reduced, remain nontrivial for some applications. Extending cache-level communication to models with different tokenizations or architectures (e.g., for cross-lingual inference) remains an open research question.
C2C, as established in both computer architecture and multi-LLM systems, encapsulates a transition towards direct, efficient, and rich transfer of internal representations. The term now straddles hardware and neural paradigms, united by their focus on cache as a fundamental substrate for high-throughput, low-latency inter-component communication (Durbhakula, 2019, Fu et al., 3 Oct 2025, Rossi et al., 19 May 2026).