NVLink-C2C: High-Bandwidth CPU-GPU Connection
- NVLink-C2C is a high-bandwidth interconnect that unifies CPU and GPU memory, enabling direct, zero-copy access for efficient large-scale AI computation.
- It supports dynamic, NUMA-aware data movement and workload scheduling, optimizing model inference across heterogeneous memory hierarchies.
- Empirical results demonstrate significant latency reductions and bandwidth gains, paving the way for elastic, multi-tenant GPU deployments.
NVLink-Chip-to-Chip (C2C) is a high-bandwidth, low-latency processor interconnect technology developed by NVIDIA, providing a non-uniform, device-mapped memory-coherent link between CPUs and GPUs. As deployed in modern NVIDIA Superchips such as GH200 and GB200, NVLink-C2C enables a unified address space across host (CPU) DRAM and GPU HBM, supporting direct, zero-copy remote memory access. This interconnect fundamentally alters server system design for large-scale AI/ML workloads, allowing resource disaggregation, efficient memory sharing, and advanced workload schedulers for spatially partitioned GPUs and heterogeneous memory hierarchies. NVLink-C2C has been empirically characterized in the context of both multi-GPU scale-up systems and new model-serving software such as C2CServe, which targets LLM inference under serverless, multi-tenant conditions (Luo et al., 19 May 2026, Li et al., 2019).
1. Physical and Logical Architecture
In contemporary architectures, each Superchip (e.g., NVIDIA GH200, GB200) pairs a Grace CPU with Hopper or Blackwell GPUs via a point-to-point mesh of NVLink-C2C links. Unlike earlier PCIe-based CPU-GPU interconnects, C2C exposes a single, coherent address space encompassing both CPU DRAM and GPU HBM. GPU kernels can directly perform load/store operations on host-resident data without explicit staging or DMA initiations.
Key parameters for GH200/GB200 include:
- Number of C2C links per GPU:
- Per-link peak bidirectional bandwidth: GB/s
- Aggregate C2C throughput: GB/s
- Typical one-way device-to-device latency: 1–2 s (comparable to HBM page fetch)
By contrast, PCIe Gen5 ×16 peaks at 64 GB/s, and previous NVLink generations on DGX-1 (P100/V100) offered per-link bandwidths in the 22–26.5 GB/s range, with 4–6 links per GPU and observed round-trip latency of 4.4–5.1 s (Li et al., 2019). The Superchip mesh topology enables every GPU slice (“MIG instance”) to access the full host DRAM bandwidth through shared C2C routers.
2. Heterogeneous Memory Hierarchy and Data Movement
In traditional GPU systems, models must be loaded into HBM before execution, with CPU DRAM connected via PCIe and not directly accessible at sufficient throughput for real-time ML inference. Superchips with NVLink-C2C treat CPU DRAM as a fast, remotely addressable memory tier. LLM weights can remain in CPU memory (using cudaHostAllocMapped for pinned, mapped memory), while activations and KV-caches utilize precious HBM space.
The C2C data-movement workflow is:
- Weights: Fetched on demand from CPU DRAM over C2C via TMA.Load, directly into GPU L2/shared memory and subsequently to tensor cores.
- Streaming throughput lower bound for weight transfer:
- For classic setups, the analogous throughput is limited by HBM bandwidth , which is typically TB/s per GPU but partitioned between MIG slices.
This architecture shifts the memory bottleneck away from scarce HBM and allows elastic model residency across a single Superchip (Luo et al., 19 May 2026).
3. Communication Kernel Design: HybridGEMM
C2CServe replaces standard cuBLAS GEMM with a HybridGEMM kernel featuring dual memory paths and a tunable partition parameter 0. Work allocation divides the 1 columns of 2 into two regions:
- Symmetric (output-stationary), size 3: iterates on both 4, 5 via C2C, maintaining 6 in registers. Favors low HBM usage but triggers repeat C2C fetches.
- Asymmetric (weight-stationary), size 7: pins 8 tiles in shared memory; accumulates partial 9 in HBM via TMA.Reduction, reducing C2C traffic for 0 but increasing 1-related HBM traffic.
Runtime feedback control adjusts 2 according to measured HBM and C2C utilizations:
3
Imbalance 4 triggers updates:
5
with learning rate 6 responsive to SLO violations in GEMM latency.
4. Contention, NUMA Effects, and Scheduling
Because all MIG slices of a GH200/GB200 Superchip share the C2C aggregate bandwidth, multi-tenant scenarios induce contention. Scheduling in C2CServe follows a three-stage hierarchy: a) Bandwidth-Aware Model Placement: For each model 7 with weight size 8 and per-token target 9, estimated C2C demand is 0. The system admits models so that 1. b) MIG-Aware Chunk Sizing: Prefill chunks are sized to balance HBM demand 2 with per-slice HBM BW and required time-to-first-token (TTFT). c) Kernel/Chunk Tuning: HybridGEMM 3 is adapted online as described above, combining per-layer (L, 4, 5) feedback.
Earlier work identified NVLink-specific NUMA effects in multi-GPU DGX-1 systems, classifying direct, dual-link, and one-/two-hop paths with up to 2× differences in bandwidth and several 6s in additional latency per hop (Li et al., 2019). In C2C multi-tenant scenarios, resource isolation and NUMA-aware workload mapping are essential for maintaining predictable service quality.
5. Empirical Performance and Comparative Analysis
On a GH200 Superchip with 96 GB HBM, 480 GB CPU memory, and 900 GB/s C2C, C2CServe was benchmarked against ServerlessLLM, Aegaeon, MoE-Infinity, and FineMoE. Results (Luo et al., 19 May 2026):
| Model | ServerlessLLM | Aegaeon | C2CServe | Speed-up vs SLLM |
|---|---|---|---|---|
| Llama-8B (dense) | 1.15 s | 1.7 s | 0.32 s | 3.6× |
| Llama-70B | OOM | OOM | 0.45 s | – |
| Mixtral-8×7B (MoE) | 5.0 s | 0.105 s | 1.1 s | 4.6× |
- C2CServe achieves cold-start latency reductions up to 7.1× (dense) and 4.6× (MoE), supports ≥95% attainment for TTFT and TPOT under contention, and enables serving 70B+ parameter models on small MIG slices without quantization or tensor-parallel sharding.
- Dynamic replay traces (>3M requests, 3 weeks) show C2CServe sustaining 95th-percentile TTFT ≈0.7 s, while baselines spike to several seconds.
- Ablation: Bandwidth-aware placement alone halves p99 TTFT; chunk control and online tuning deliver further multiplicative reductions.
These results establish NVLink-C2C as a critical enabler for elastic, memory-disaggregated LLM serving in serverless, multi-tenant GPU settings.
6. Comparative Context and Best Practices
Compared to prior NVLink-V1 and NVLink-V2 interconnects, NVLink-C2C offers an order of magnitude higher bandwidth (7900 GB/s vs. 850 GB/s bi-directional per link) and enables a host-GPU unified address space, eliminating traditional DMA staging steps (Li et al., 2019). Early NVLink deployments faced NUMA complexities, requiring explicit mapping to maximize bandwidth and minimize latency. Not all GPU-to-GPU paths were equivalent: routing-NUMA and neighbor-NUMA effects could halve or double link throughput, complicating MPI/NCCL topology mappings.
For C2C deployments, best practices include:
- Assigning communication-heavy workloads to direct (or dual-link) C2C links.
- Dynamic, NUMA- and bandwidth-aware model placement to avoid over-subscribing per-chip links.
- Multi-stage scheduling, chunk size adaptation, and kernel autotuning to optimize for per-request SLOs and global bandwidth allocation.
- Favoring unified memory management models leveraged by the NVLink address space abstraction.
A plausible implication is increased reliance on host memory in large-model inference, with a corresponding reduction in HBM pressure and improvement in resource utilization per GPU, especially for serverless or highly fragmented workloads.
7. Conclusion and Outlook
NVLink-Chip-to-Chip (C2C) transforms heterogeneous server design by enabling direct, high-throughput memory access between CPUs and GPUs. In multi-GPU or spatially partitioned (MIG) settings, C2C permits hosting models in host DRAM and streaming weights on demand, directly addressing recent serverless LLM serving bottlenecks. The C2CServe system demonstrates that with appropriate kernel architecture (HybridGEMM), chunking, and bandwidth-aware scheduling, C2C achieves significant latency and resource utilization gains under competitive, multi-tenant conditions. NVLink-C2C thus represents both an architectural and a systems software inflection point for future large-scale AI/ML infrastructure (Luo et al., 19 May 2026, Li et al., 2019).