GH200 Superchips: Exascale CPU–GPU Integration
- GH200 Superchip is a heterogeneous CPU–GPU package that integrates Hopper GPUs and Grace ARM CPUs with unified HBM3 for exascale AI and HPC workloads.
- It delivers significant improvements in compute density, energy efficiency, and memory bandwidth via advanced mixed-precision and CRT-based matrix engine techniques.
- The platform enables breakthrough performance in scientific computing, deep learning, and simulation by leveraging unified memory, cache-coherent interconnects, and dynamic power management.
The GH200 Superchip is a tightly integrated heterogeneousCPU–GPU package developed by NVIDIA, primarily targeting exascale-class AI, high-performance computing (HPC), and high-throughput scientific applications. The platform combines a Hopper-architecture GPU and a Grace Arm-based CPU within a single package, sharing a unified high-bandwidth memory (HBM3) and interconnected via cache-coherent NVLink-C2C links that deliver previously unattainable levels of intra-node bandwidth and memory coherence. The GH200 platform sets new standards for compute density, energy efficiency, memory bandwidth, interconnect latency, and the granularity of hardware–software co-design in tightly coupled CPU–GPU architectures.
1. Hardware Architecture and Memory/Interconnect Hierarchy
The GH200 Superchip consists of:
- A Hopper (H100-class) GPU with up to 132 Streaming Multiprocessors (SMs), next-generation Tensor Cores supporting INT8 MMAs, and 96 GB (or more, up to 480 GB in certain configurations) of on-package HBM3.
- A 72-core Grace ARM Neoverse V2 CPU, with 64 KB L1 I + 64 KB L1 D and 1 MB L2 per core, a 114 MB shared L3, and 480 GB LPDDR5X.
- NVLink-C2C interconnect providing up to 900 GB/s bidirectional cache-coherent bandwidth between CPU and GPU within a superchip, and up to 450 GB/s between GPUs or to peer CPUs.
- Unified virtual address space (system-wide 64 KB pages): Address Translation Services (ATS) over NVLink-C2C plus a dedicated ATS-TBU unit implement unified page tables and MESI coherence across CPU and GPU caches.
- On-package TDP (thermal design power) sharing: a unified power budget (nominally 600–1,000 W) partitioned dynamically according to workload requirements.
Measured intra-node memory bandwidth and latency are summarized below:
| Path | Bandwidth (GB/s) | Latency (ns) |
|---|---|---|
| GPU→local HBM3 | ~3,700 (93%) | ~130 |
| CPU→local DDR | ~465 (93%) | ~80 |
| CPU→HBM3 via NVLink | ~240–288 (53–64%) | ~140 |
| GPU→DDR via NVLink | ~418–378 (84–93%) | ~130 |
| GPU→peer HBM3 (NVLink) | ~80–100 | ~175 |
| CPU↔CPU (chiplet) | ~260 (ping-pong) | ~260 |
Additional features include AMBA-CHI compliance, atomic RMW operations, virtual memory support at 64 B coherence granularity, and GPUDirect RDMA for zero-copy intra/inter-node transfers.
2. Mixed-Precision and Low-Precision Compute Engines
The GH200 incorporates 12th-generation Tensor Cores that support high-throughput INT8×INT8→INT32 MMA units, with individual units able to perform 256×256 bit MMAs per clock per SM. This enables peak INT8 throughput in the multi-TOPS regime (tens of tera-operations per second).
The architecture supports efficient emulation of high-precision GEMM operations (SGEMM/DGEMM) using CRT-based schemes (Ozaki Scheme II). By decomposing large high-precision matrix multiplications into a sequence of INT8 GEMMs (with pairwise-coprime moduli), blocking them into per-SM shared memory tiles, and reconstructing the result via fast modular arithmetic and floating-point FMAs, the GH200 enables:
- 1.44× speedup for DGEMM emulation (FP64) and 43% better power efficiency versus native DGEMM at large matrix sizes (n ≥ 16,384).
- 3.0× speedup and up to 154% greater power efficiency for SGEMM (FP32) versus native SGEMM.
- Over 2× performance compared to prior emulation schemes (Ozaki Scheme I).
The CRT-based approach uses pairwise-coprime moduli, scaling/quantizing rows/columns, maximizing throughput by blocking into maximal MMA-sized tiles, and streaming per-tile partial results into a single FP accumulation kernel that hides memory and modular-reconstruction costs. Multiple modes are available for scaling (fast/approximate using Cauchy–Schwarz bounds; accurate using INT8 bounding GEMM). All modular operations leverage precomputed division tables and fused multiply-add instructions, minimizing hardware division or fmod latency (Uchino et al., 6 Aug 2025).
This implementation allows the GH200’s AI-specialized hardware to be efficiently repurposed for full-precision HPC workloads, scientific simulation codes, and arbitrary-precision tensor operations.
3. Power Efficiency, Energy Management, and Thermal Behavior
The GH200’s integrated power and thermal management supports:
- A single, programmable CPU–GPU power cap for the entire superchip.
- Dynamic power steering: unused CPU/GPU headroom is automatically transferred to the other domain as needed.
- Fine-grained per-kernel or per-task power optimization, with metrics such as Speedup-Energy-Delay (SED) and normalized runtime/energy (Euclidean-distance to ideal) used to tune per-codelet power caps.
For example, compute-bound GPU kernels require near-maximal caps for best time-to-solution, while memory-bound or idle kernels can see up to 40–47% reduction in energy with only ~10–30% increase in wall-clock time at lower caps. These strategies produce aggregate energy savings (~150–200% in idealized settings) and improved system-level MW-scale energy efficiency at exascale (Patrou et al., 27 May 2025).
4. Memory Placement, Data Movement, and NUMA Optimization
GH200 exposes a unified address space, but physical locality remains the dominant factor for achieved bandwidth and latency:
- Placing all operands in on-chip HBM3 enables single-GEMM FP64 throughput at ~67 TFLOPS; moving data to DDR (system memory) or across C2C links halves or third’s performance.
- CUDA allocation APIs (cudaMalloc, cudaMallocHost, cudaMallocManaged) trade off locality, ease of migration, and visibility for different use cases. Intra-package data placement should preferentially use cudaMalloc HBM for GPU-resident compute, DDR local-to-CPU for host-resident, and pinned host memory for explicit large CPU→GPU DMA transfers.
- Each additional NVLink or inter-node hop incurs ~0.08 µs extra latency and reduces max attainable bandwidth by 3–5×.
- On multi-node installations, optimal process/rank binding, sufficient rank-per-NIC to saturate inter-node bandwidth, and careful mapping to DDR/HBM domains are required for optimal performance (Fusco et al., 21 Aug 2024).
5. Real-World Application Benchmarks and System Scaling
Dense Linear Algebra and Scientific Computing
- Out-of-core left-looking Cholesky factorization with mixed-precision scheduling (FP64/FP32/FP16/FP8): single-GPU throughput of 58.9 TFLOPS in FP64 (20% faster than cuSOLVER), scaling up to 185.5 TFLOPS on 4 GPUs at 80–90% ideal efficiency. Mixed-precision reaches 136 TFLOPS single-GPU, with up to 3× speedup at acceptable Kullback–Leibler divergence for tolerance down to 1e-5 (Ren et al., 13 Oct 2024).
- The GH200 achieves weak and strong scaling in first-principles global earth-system simulations (1.25 km ICON model), reaching temporal compression ratios of up to 145.7 simulated days/day at 20,480 superchips, with per-node dynamic power balancing and domain-specific CPU/GPU functional partitioning (Klocke et al., 3 Nov 2025).
- Bayesian inference in spatio-temporal Gaussian processes achieves two–three orders of magnitude improvement in time-to-solution by exploiting GH200’s HBM3 bandwidth, NVLink/NCCL collectives, and hierarchical parallel scheme across hundreds of superchips (Gaedke-Merzhäuser et al., 9 Jul 2025).
Large-Scale AI, Deep Learning, and LLMs
- In LLM training, closely coupled CPU–GPU integration enables the SuperOffload system to deliver up to 2.5× the throughput of ZeRO-Offload while allowing single-node training with 25B parameter models (and 200B across 16 superchips), supporting context length up to 1M tokens at >50% MFU. Key techniques include: adaptive weight-offloading, fine-grained bucketization (tuned to C2C bandwidth), speculative execution, and ARM-optimized vectorized Adam (Lian et al., 25 Sep 2025).
- LLM inference workloads achieve 1.9–2.7× faster large-batch prefill than PCIe-based alternatives thanks to high NVLink bandwidth; however, at low batch sizes, inference on GH200 remains CPU-bound over a larger range (up to 4× LC systems) due to ARM Neoverse single-thread launch overhead. Techniques such as kernel fusion guided by TKLQT profiling are necessary to reduce kernel launch tax (Vellaisamy et al., 16 Apr 2025).
Specialized and Heterogeneous Workloads
- On the National Research Platform, GH200 delivers 35–55% faster training than A100 for ViTs and transformers, and can even outperform multi-GPU A100 setups on large-scale transformer detection when model size and on-chip memory are the primary bottlenecks (Hurt et al., 21 Oct 2024).
- Biomolecular simulation codes benefit from the 61 TFLOPS FP32/30 TFLOPS FP64 peak of the H100, with up to 102 GFLOP/W and 480 GB main memory per node, enabling high-throughput molecular dynamics while highlighting ARM-compatibility caveats with legacy scientific packages (Welch et al., 18 Jun 2025).
- Quantum circuit simulation (JUQCS-50) leverages GH200’s unified HBM3+LPDDR5 for 50-qubit emulation, novel adaptive data encoding, and on-the-fly network/MPI optimization to achieve an 11.4× speedup vs. earlier records (48 qubits) (Raedt et al., 5 Nov 2025).
- CPU-only workloads (e.g., ordered parallel MOS algorithms) can effectively exploit all 72 ARM cores, with speedups up to 23× versus single-thread, provided label-extraction latency and update serialization bottlenecks are mitigated (Gold et al., 25 Nov 2024).
6. Theoretical Limits and Performance Ceilings
The GH200, when situated within the continuous medium model of supercomputing, approaches the classical performance ceiling set by physical resource densities and the speed of light. Even computing with infinite density (, ), communication/latency costs will bottleneck exascale workloads such as CG, FFT, and matrix multiplication at levels set by on-die/intra-system propagation velocities and physical extent (). For the DGX GH200 (peak Pflop/s, ), the theoretical CG iteration ceiling is FLOP/s, far exceeding even 1000× denser classical designs, unless physical distance or are fundamentally changed (Karp et al., 9 May 2024).
Pushing further requires photonic integration, locality-optimized layout, tight algorithm–hardware co-design (communication-avoiding and approximate schemes), and system-wide balancing of compute-, memory-, and communication-densities.
7. Implications and Future Directions
The GH200’s platform characteristics enable:
- Acceleration of classical HPC workloads through re-purposed AI hardware (e.g., CRT-emulated DGEMM/SGEMM for scientific codes).
- Unified, program-accessible memory across CPU–GPU, facilitating highly heterogeneous, concurrent application execution (Earth system, quantum simulation).
- Order-of-magnitude increases in throughput and energy efficiency for deep learning, spatio-temporal inference, and molecular modeling.
- The possibility of “any-precision” supercomputing through modular, arbitrarily high-accuracy matrix engines and CRT frameworks.
- New system-level constraints and opportunities defined by large-scale cache-coherence, power steering, and TDP co-management.
A key caveat is that while the unified memory and interconnect flatten traditional host–device boundaries, ultimate performance still depends overwhelmingly on aligning workload data placement and access patterns with the physical location (on-chip HBM3 versus DDR, intra-superchip versus peer link). Automated task-aware data mapping and NUMA/affinity strategies are thus required for real-world deployment.
The GH200 Superchip exemplifies the direction of exascale supercomputing towards tightly coupled, high-density, heterogeneous nodes, but it also exposes the need for cross-disciplinary algorithm–system co-design to realize its full potential within the physical and architectural limits of classical computation.