NVLink-C2C: High-Bandwidth CPU-GPU Connection

Updated 27 May 2026

NVLink-C2C is a high-bandwidth interconnect that unifies CPU and GPU memory, enabling direct, zero-copy access for efficient large-scale AI computation.
It supports dynamic, NUMA-aware data movement and workload scheduling, optimizing model inference across heterogeneous memory hierarchies.
Empirical results demonstrate significant latency reductions and bandwidth gains, paving the way for elastic, multi-tenant GPU deployments.

NVLink-Chip-to-Chip (C2C) is a high-bandwidth, low-latency processor interconnect technology developed by NVIDIA, providing a non-uniform, device-mapped memory-coherent link between CPUs and GPUs. As deployed in modern NVIDIA Superchips such as GH200 and GB200, NVLink-C2C enables a unified address space across host (CPU) DRAM and GPU HBM, supporting direct, zero-copy remote memory access. This interconnect fundamentally alters server system design for large-scale AI/ML workloads, allowing resource disaggregation, efficient memory sharing, and advanced workload schedulers for spatially partitioned GPUs and heterogeneous memory hierarchies. NVLink-C2C has been empirically characterized in the context of both multi-GPU scale-up systems and new model-serving software such as C2CServe, which targets LLM inference under serverless, multi-tenant conditions (Luo et al., 19 May 2026, Li et al., 2019).

1. Physical and Logical Architecture

In contemporary architectures, each Superchip (e.g., NVIDIA GH200, GB200) pairs a Grace CPU with Hopper or Blackwell GPUs via a point-to-point mesh of NVLink-C2C links. Unlike earlier PCIe-based CPU-GPU interconnects, C2C exposes a single, coherent address space encompassing both CPU DRAM and GPU HBM. GPU kernels can directly perform load/store operations on host-resident data without explicit staging or DMA initiations.

Key parameters for GH200/GB200 include:

Number of C2C links per GPU: $N_{\text{links}}\approx2$
Per-link peak bidirectional bandwidth: $B_{\text{link}}\approx450$ GB/s
Aggregate C2C throughput: $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ GB/s
Typical one-way device-to-device latency: $\sim$ 1–2 $\mu$ s (comparable to HBM page fetch)

By contrast, PCIe Gen5 ×16 peaks at $\sim$ 64 GB/s, and previous NVLink generations on DGX-1 (P100/V100) offered per-link bandwidths in the 22–26.5 GB/s range, with 4–6 links per GPU and observed round-trip latency of 4.4–5.1 $\mu$ s (Li et al., 2019). The Superchip mesh topology enables every GPU slice (“MIG instance”) to access the full host DRAM bandwidth through shared C2C routers.

2. Heterogeneous Memory Hierarchy and Data Movement

In traditional GPU systems, models must be loaded into HBM before execution, with CPU DRAM connected via PCIe and not directly accessible at sufficient throughput for real-time ML inference. Superchips with NVLink-C2C treat CPU DRAM as a fast, remotely addressable memory tier. LLM weights can remain in CPU memory (using cudaHostAllocMapped for pinned, mapped memory), while activations and KV-caches utilize precious HBM space.

The C2C data-movement workflow is:

Weights: Fetched on demand from CPU DRAM over C2C via TMA.Load, directly into GPU L2/shared memory and subsequently to tensor cores.
Streaming throughput lower bound for weight transfer: $R_{\text{decode},\min} = \frac{S_w}{T_{\text{C2C}}}$
For classic setups, the analogous throughput is limited by HBM bandwidth $T_{\text{HBM}}$ , which is typically $\approx4$ TB/s per GPU but partitioned between MIG slices.

This architecture shifts the memory bottleneck away from scarce HBM and allows elastic model residency across a single Superchip (Luo et al., 19 May 2026).

3. Communication Kernel Design: HybridGEMM

C2CServe replaces standard cuBLAS GEMM with a HybridGEMM kernel featuring dual memory paths and a tunable partition parameter $B_{\text{link}}\approx450$ 0. Work allocation divides the $B_{\text{link}}\approx450$ 1 columns of $B_{\text{link}}\approx450$ 2 into two regions:

Symmetric (output-stationary), size $B_{\text{link}}\approx450$ 3: iterates on both $B_{\text{link}}\approx450$ 4, $B_{\text{link}}\approx450$ 5 via C2C, maintaining $B_{\text{link}}\approx450$ 6 in registers. Favors low HBM usage but triggers repeat C2C fetches.
Asymmetric (weight-stationary), size $B_{\text{link}}\approx450$ 7: pins $B_{\text{link}}\approx450$ 8 tiles in shared memory; accumulates partial $B_{\text{link}}\approx450$ 9 in HBM via TMA.Reduction, reducing C2C traffic for $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 0 but increasing $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 1-related HBM traffic.

Runtime feedback control adjusts $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 2 according to measured HBM and C2C utilizations:

$T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 3

Imbalance $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 4 triggers updates:

$T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 5

with learning rate $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 6 responsive to SLO violations in GEMM latency.

4. Contention, NUMA Effects, and Scheduling

Because all MIG slices of a GH200/GB200 Superchip share the C2C aggregate bandwidth, multi-tenant scenarios induce contention. Scheduling in C2CServe follows a three-stage hierarchy: a) Bandwidth-Aware Model Placement: For each model $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 7 with weight size $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 8 and per-token target $T_{\text{C2C}}=N_{\text{links}}\times B_{\text{link}}\approx900$ 9, estimated C2C demand is $\sim$ 0. The system admits models so that $\sim$ 1. b) MIG-Aware Chunk Sizing: Prefill chunks are sized to balance HBM demand $\sim$ 2 with per-slice HBM BW and required time-to-first-token (TTFT). c) Kernel/Chunk Tuning: HybridGEMM $\sim$ 3 is adapted online as described above, combining per-layer (L, $\sim$ 4, $\sim$ 5) feedback.

Earlier work identified NVLink-specific NUMA effects in multi-GPU DGX-1 systems, classifying direct, dual-link, and one-/two-hop paths with up to 2× differences in bandwidth and several $\sim$ 6s in additional latency per hop (Li et al., 2019). In C2C multi-tenant scenarios, resource isolation and NUMA-aware workload mapping are essential for maintaining predictable service quality.

5. Empirical Performance and Comparative Analysis

On a GH200 Superchip with 96 GB HBM, 480 GB CPU memory, and 900 GB/s C2C, C2CServe was benchmarked against ServerlessLLM, Aegaeon, MoE-Infinity, and FineMoE. Results (Luo et al., 19 May 2026):

Model	ServerlessLLM	Aegaeon	C2CServe	Speed-up vs SLLM
Llama-8B (dense)	1.15 s	1.7 s	0.32 s	3.6×
Llama-70B	OOM	OOM	0.45 s	–
Mixtral-8×7B (MoE)	5.0 s	0.105 s	1.1 s	4.6×

C2CServe achieves cold-start latency reductions up to 7.1× (dense) and 4.6× (MoE), supports ≥95% attainment for TTFT and TPOT under contention, and enables serving 70B+ parameter models on small MIG slices without quantization or tensor-parallel sharding.
Dynamic replay traces (>3M requests, 3 weeks) show C2CServe sustaining 95th-percentile TTFT ≈0.7 s, while baselines spike to several seconds.
Ablation: Bandwidth-aware placement alone halves p99 TTFT; chunk control and online tuning deliver further multiplicative reductions.

These results establish NVLink-C2C as a critical enabler for elastic, memory-disaggregated LLM serving in serverless, multi-tenant GPU settings.

6. Comparative Context and Best Practices

Compared to prior NVLink-V1 and NVLink-V2 interconnects, NVLink-C2C offers an order of magnitude higher bandwidth ( $\sim$ 7900 GB/s vs. $\sim$ 850 GB/s bi-directional per link) and enables a host-GPU unified address space, eliminating traditional DMA staging steps (Li et al., 2019). Early NVLink deployments faced NUMA complexities, requiring explicit mapping to maximize bandwidth and minimize latency. Not all GPU-to-GPU paths were equivalent: routing-NUMA and neighbor-NUMA effects could halve or double link throughput, complicating MPI/NCCL topology mappings.

For C2C deployments, best practices include:

Assigning communication-heavy workloads to direct (or dual-link) C2C links.
Dynamic, NUMA- and bandwidth-aware model placement to avoid over-subscribing per-chip links.
Multi-stage scheduling, chunk size adaptation, and kernel autotuning to optimize for per-request SLOs and global bandwidth allocation.
Favoring unified memory management models leveraged by the NVLink address space abstraction.

A plausible implication is increased reliance on host memory in large-model inference, with a corresponding reduction in HBM pressure and improvement in resource utilization per GPU, especially for serverless or highly fragmented workloads.

7. Conclusion and Outlook

NVLink-Chip-to-Chip (C2C) transforms heterogeneous server design by enabling direct, high-throughput memory access between CPUs and GPUs. In multi-GPU or spatially partitioned (MIG) settings, C2C permits hosting models in host DRAM and streaming weights on demand, directly addressing recent serverless LLM serving bottlenecks. The C2CServe system demonstrates that with appropriate kernel architecture (HybridGEMM), chunking, and bandwidth-aware scheduling, C2C achieves significant latency and resource utilization gains under competitive, multi-tenant conditions. NVLink-C2C thus represents both an architectural and a systems software inflection point for future large-scale AI/ML infrastructure (Luo et al., 19 May 2026, Li et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG (2026)

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVLink-Chip-to-Chip (C2C).

NVLink-C2C: High-Bandwidth CPU-GPU Connection

1. Physical and Logical Architecture

2. Heterogeneous Memory Hierarchy and Data Movement

3. Communication Kernel Design: HybridGEMM

4. Contention, NUMA Effects, and Scheduling

5. Empirical Performance and Comparative Analysis

6. Comparative Context and Best Practices

7. Conclusion and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NVLink-C2C: High-Bandwidth CPU-GPU Connection

1. Physical and Logical Architecture

2. Heterogeneous Memory Hierarchy and Data Movement

3. Communication Kernel Design: HybridGEMM

4. Contention, NUMA Effects, and Scheduling

5. Empirical Performance and Comparative Analysis

6. Comparative Context and Best Practices

7. Conclusion and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research