NVLink-C2C Interconnect Overview

Updated 7 May 2026

NVLink-C2C is a cache-coherent, high-bandwidth, low-latency chip-to-chip interconnect that seamlessly connects NVIDIA GPUs and CPUs.
It employs a mesh topology and layered protocol stack to achieve effective bandwidths up to 900 GB/s per direction with sub-microsecond latencies.
Advanced features like unified memory support, efficient task scheduling, and side-channel mitigation make NVLink-C2C vital for high-performance multi-GPU systems.

NVIDIA NVLink-C2C (chip-to-chip) is a fourth- and fifth-generation, ultra-high-bandwidth, low-latency, cache-coherent interconnect fabric designed to connect GPUs to each other and to CPUs, notably within tightly integrated modules such as the NVIDIA Grace Hopper (GH200) and Grace Blackwell (GB200) Superchips. NVLink-C2C plays a central role in providing both scale-up and unified memory semantics for heterogeneous node designs, with performance and architectural characteristics that significantly exceed those of PCIe and prior interconnects in both bandwidth and latency envelopes (Ren et al., 2024, Yu et al., 28 Jan 2026, Sensi et al., 2024, Jung, 9 Jul 2025).

1. Architecture and Topology

NVLink-C2C is implemented as a mesh of point-to-point links between NVIDIA GPUs and CPUs residing on the same package/module or across modules interconnected by NVSwitch. Each link is composed of multiple high-speed differential pairs (lanes), with per-lane bitrates of 25–50 Gb/s (NRZ or PAM-4), and per-link effective bandwidths of 25–50 GB/s in each direction. In the GH200 Superchip, each GPU tile exposes 18–22 NVLink-C2C links, reaching an aggregate one-way bandwidth of ≈ 900 GB/s (Grace Hopper) or up to 1.2 TB/s bidirectional in multi-GPU configurations (Ren et al., 2024, Yu et al., 28 Jan 2026).

A typical NVLink-C2C topology within a GH200 module forms a direct mesh between the Grace CPU and Hopper GPU, enabling uniform, low-latency, and high-throughput access in both directions. At larger scales, modules such as GB200 employ NVSwitch crossbars to form Clos/fat-tree topologies across up to 72 or more GPUs per rack, allowing single-hop routing with minimal switch penalty (Jung, 9 Jul 2025).

Cache coherence is enforced as a hardware protocol spanning both sides—on chip, the fabric integrates with the CPU’s last level cache and DDR controllers (CPU cacheline 64 B granularity) and the GPU’s L2/memory slices (GPU cacheline 128 B granularity). All remote memory access (loads, stores, atomics) traverses the NVLink-C2C as hardware-coherent, strongly ordered transactions (Schieffer et al., 9 Apr 2026, Li et al., 2024).

2. Protocol Stack and Coherence Mechanisms

The NVLink-C2C protocol stack consists of:

Physical layer: Differential signaling across package traces or silicon interposers, with link training for equalization and deskew; lane count and speed are parameterized by module generation.
Link layer: Credit-based flow control, with per-flit credits, framing, CRC, and optional scrambling. Standard flit sizes are 128 B (8 B header + 8 B CRC + 112 B payload) (Jung, 9 Jul 2025).
Transaction layer: Cache-coherent and transactional support for memory loads/stores, atomics, and DMA, with packetized messaging for ordering and coherence state transitions. The protocol supports directory-less, broadcast-based coherence for CPU-GPU and GPU-GPU sharing (Schieffer et al., 9 Apr 2026, Li et al., 2024).

Ordering is guaranteed across the domain. Atomic operations are globally visible, and software fences enforce visibility semantics as required. Coherence state transitions (e.g., dirty-invalid-share) trigger fetch or invalidation transactions across the NVLink-C2C fabric.

3. Performance Models and Microarchitectural Metrics

3.1 Bandwidth and Latency

Theoretical peak bandwidth per direction is a product of the number of links and link width: $BW_\mathrm{total} = N_\mathrm{links} \times BW_\mathrm{per\,link}$

GH200: $18 \leq N_\mathrm{links} \leq 22$ , $BW_\mathrm{per\,link} = 50\,\text{GB/s}$ ⇒ $BW_\mathrm{total} \approx 900\,\text{GB/s}$ per direction (Ren et al., 2024, Yu et al., 28 Jan 2026).
Round-trip latencies are typically 0.7–1.0 μs for payloads ≥ 4 KB (half that of PCIe Gen4/5) (Ren et al., 2024).
NVLink 4.0 on H100: six 200 Gb/s links per pair, yielding 1.2 Tb/s (≈ 150 GB/s) one-way (Sensi et al., 2024).

Application-level attained bandwidth is typically limited by host DRAM (Grace side), measured at 200–384 GB/s per direction for large transfers (≥8 MB) (Yu et al., 28 Jan 2026, Li et al., 2024). Practical overheads (framing, CRC, flow control) reduce peak to ~90% of the theoretical value.

3.2 Compute-Communication Overlap

Performance is often modeled as: $T(N) = \max \left( T_\mathrm{comp}, T_\mathrm{comm} \right )$

$T_\mathrm{comp} = \frac{\alpha N^{3}}{f_\mathrm{peak}}, \quad T_\mathrm{comm} = \frac{\beta N^{2}}{B_\mathrm{link}}$

For Cholesky factorization, $\alpha=1/3, \beta=1/2$ (Ren et al., 2024). The communication bottleneck is hidden for problem sizes where compute dominates, i.e., when

$N \ge \frac{\beta f_\mathrm{peak}}{\alpha B_\mathrm{link}}$

With $f_\mathrm{peak}=20\,\text{TFLOP/s}$ and $B_\mathrm{link}=900\,\text{GB/s}$ , crossover occurs at very small tile/block sizes, ensuring overlap is easily achieved in practical regimes (Ren et al., 2024).

4. Practical Utilization: Scheduling, Task Decomposition, and Applications

4.1 Static Task Scheduling and Data Movement

For out-of-core or memory-intensive algorithms, concurrency is achieved via static scheduling of overlapping compute and communication. For example, in mixed-precision Cholesky, fine-grained decomposition of the DAG into POTRF, TRSM, SYRK, GEMM is mapped to streams, each issuing asynchronous cudaMemcpy via the NVLink-C2C fabric (Ren et al., 2024). High sustained bandwidth (≥900 GB/s) allows transfer of tiles to be hidden under kernel execution.

4.2 Memory Offloading and Unified Memory

Grace Hopper and Blackwell modules support full cache-coherent unified memory. All allocations, whether via cudaMallocManaged or standard host malloc, are accessible from both CPU and GPU; page migration across DRAM and HBM3e is performed over NVLink-C2C (Li et al., 2024). Automatic offload tools for BLAS and LAPACK achieve up to 23× kernel speedup and 3.3× end-to-end acceleration on BLAS-heavy scientific codes (Li et al., 2024).

In mixed-GPU scenarios or when using MIG (Multi-Instance GPU), the NVLink-C2C interconnect acts as a second-tier store, allowing finer-grained slicing than hardware slices alone. Direct-access kernels can achieve 338 GB/s bandwidth even on smallest slices, enabling a practical plug-in spill mechanism for memory oversubscription (Schieffer et al., 9 Apr 2026).

4.3 LLM Serving and Memory Rotation

SuperInfer demonstrates real-time LLM inference via SLO-aware rotary scheduling (RotaSched) and high-efficiency memory rotation engine (DuplexKV) over NVLink-C2C on GH200 (Yu et al., 28 Jan 2026):

KV caches carved into 4 MB blocks, aggressively rotated between HBM and DRAM,
kernel launches are batched with full-duplex bidirectional transfers,
empirical link utilization of ≈92–95% for contiguous megabyte-range transfers,
service-level objectives such as TTFT improve by up to 74.7%, with minimal impact on throughput.

5. Multi-GPU, Multi-Path, and Software Orchestration

NVLink-C2C hardware supports all-to-all, full-mesh, and Clos topologies. For optimal link utilization under skewed traffic, software frameworks like NIMBLE (Node-Interconnect Multi-path Balancing with Execution-time orchestration) formulate traffic balancing as a minimum-congestion flow problem and allocate multi-path routes dynamically (Yao et al., 31 Mar 2026):

Bandwidth: 4-GPU H100 nodes (NVLink4) saturate each GPU pair at 120 GB/s; multi-path striping yields 2.3× greater link utilization under skew.
Latency: Tree-structured application-level collectives achieve sub-4.5 µs one-way latency at full bandwidth (Sensi et al., 2024).
Optimization: Multiplicative-weights algorithms efficiently distribute traffic to avoid hotspots not handled by static libraries such as NCCL or MPI.

Proper tuning of software parameters (e.g., NCCL channels, peer access, GPUDirect settings) is necessary to reach or approach theoretical bandwidth in practice (Sensi et al., 2024).

6. Comparative Performance, Scalability, and Hierarchical Architectures

Platform/Link	# Links	Raw BW per GPU Pair	App-level BW	Latency (typ.)	Topology
GH200 (C2C)	18–22	900+ GB/s	~900 GB/s	0.7–1.0 µs	Mesh, direct
H100 (NVLink4.0)	6	1.2 Tb/s	~150 GB/s	~4.3 µs	Fully connected
A100 (NVLink3.0)	4	800 Gb/s	~100 GB/s	3.0–4.0 µs	K₄ complete
PCIe Gen5	—	~64 GB/s	~32 GB/s	~3–9 µs	Crossbar/tree

NVLink-C2C provides a 3–5× advantage over PCIe Gen4/5 for large transfers and up to 10× for small messages (Ren et al., 2024, Sensi et al., 2024). Multi-GPU datacenter topologies utilize NVSwitch and hybrid fabrics (CXL-over-NVLink/XLink) to aggregate up to 72 GPUs per cluster/rack, scaling to thousands of endpoints with CXL at higher latency (Jung, 9 Jul 2025).

Practical caveats include software/library topology awareness, NUMA effects, and hardware-specific link allocation. Application scaling is near-linear in computation-bound workloads where NVLink-C2C hides transfer latency (Ren et al., 2024).

7. Security, Side-channel Implications, and Mitigation

Recent work has identified NVLink (and by extension NVLink-C2C) as susceptible to covert and side-channel attacks (Zhang et al., 22 Mar 2025, Zhang et al., 2024):

Two primary leakages: timing variations under contention and shared/performance counters,
Empirical covert channel: up to 70 kbps, BER ~4.8%, and effective cross-VM.
Application and workload fingerprinting achieves high accuracy (F₁ up to 97.8%).

Mitigations include disabling or sandboxing performance counters, timer resolution reduction, and potentially hardware-level dynamic link partitioning or rate limiting. Side-channel resilience is an ongoing area of NVLink-C2C evolution.

In summary, NVLink-C2C is the dominant scale-up, cache-coherent interconnect for modern NVIDIA heterogeneous nodes and multi-GPU clusters. Its extremely high bandwidth, low latency, hardware-enforced coherence, and unified memory integration enable advanced scheduling, memory oversubscription, and fine-grained data and compute co-design. Achieving and sustaining theoretical performance requires both architectural and software-hardware co-optimization, as well as attention to potential security issues in shared, multi-tenant environments (Ren et al., 2024, Yu et al., 28 Jan 2026, Jung, 9 Jul 2025, Schieffer et al., 9 Apr 2026, Sensi et al., 2024, Li et al., 2024, Zhang et al., 22 Mar 2025, Zhang et al., 2024, Yao et al., 31 Mar 2026, Li et al., 2019).