GH200 Grace Hopper Superchip

Updated 7 August 2025

GH200 is a heterogeneous computing device that integrates NVIDIA’s Hopper GPU and Grace CPU to deliver high computational throughput and unified memory architecture.
It features advanced microarchitectural innovations such as fourth-generation tensor cores with FP8 support and dedicated DPX units that accelerate AI and dynamic programming tasks.
The chip’s unified memory system and NVLink-C2C interconnect enable low-latency, cache-coherent data sharing, optimizing performance for HPC, AI inference, and scientific simulations.

The GH200 Grace Hopper Superchip is a tightly integrated heterogeneous computing device that combines an NVIDIA Hopper GPU and an NVIDIA Grace CPU within a single package. Designed to deliver high computational throughput, reduced memory latency, and advanced programmability for large-scale AI, high-performance computing (HPC), and scientific applications, the GH200 leverages key microarchitectural and system-level innovations including fourth-generation tensor cores with FP8 support, dynamic programming (DPX) instruction units, unified memory with hardware cache coherence, and an ultra-high-bandwidth NVLink Chip-to-Chip (C2C) interconnect. Its architecture enables new capabilities in software optimization, code porting, and large-scale data-intensive workloads, redefining the CPU–GPU integration paradigm.

1. Microarchitecture and Instruction Set Innovations

The GH200’s Hopper GPU introduces multiple novel hardware features not present in earlier Nvidia architectures. The most significant is the enhancement of tensor cores (now in their fourth generation) to support the FP8 format (E5M2 and E4M3 variants), dramatically increasing raw throughput for LLM training and inference while maintaining acceptable accuracy in deep learning. Dedicated DPX instruction units provide native hardware acceleration for dynamic programming (DP) problems—e.g., minimum/maximum recurrences as in Smith-Waterman—delivering hardware speedups (up to 13× over emulation for some 16-bit DPX operations).

The GH200 further incorporates distributed shared memory (DSM), allowing threads across different thread blocks (even on different streaming multiprocessors within a cluster) to directly access or map each other’s shared memory. This feature relieves pressure on on-chip memory hierarchies and can reduce intra-kernel communication overheads by up to 7×.

At the instruction set level, Hopper extends beyond legacy “mma” and “mma.sp” (sparse) PTX instructions by introducing asynchronous warp-group-level “wgmma” and “wgmma.sp” operations. These enable four warps to execute MMA fusions with direct overlapping of data movement and computation, yielding improved device occupancy and throughput especially under computationally dense scenarios. SASS analysis confirms the mapping of FP8 operations to optimized GMMA (QGMMA) instructions, in contrast to FP16’s HMMA mapping.

2. Unified Memory and Cache-Coherent System Architecture

A defining feature of GH200 is its unified memory architecture, which presents a single virtual address space for both CPU (Grace) and GPU (Hopper)—eliminating the classical separation between host and accelerator memory. The Unified Memory Architecture (UMA) is achieved via a shared, system-wide page table managed in CPU DRAM and a System Memory Management Unit (SMMU) compatible with Arm’s SMMUv3 specification. This enables virtual-to-physical translation across both processing units, supported with hardware page walkers for efficient access.

NVLink-C2C, a cache-coherent interconnect with 450 GB/s peak bandwidth per direction, physically links the Grace CPU (armed with 72 Neoverse V2 cores and 480 GB of LPDDR5/LPDDR5X memory) and the Hopper GPU (with 132 SMs and 96 GB of HBM3), facilitating fine-grained (down to 64 or 128 bytes) cacheline transfers across devices. This architecture allows both devices to transparently share pinned, managed, or system-allocated memory, with automatic hardware-level migration policies (e.g., first-touch, access-counter based) and prefetch capabilities.

Larger system page sizes (e.g., 64 KB) further accelerate allocation and migration operations; in quantum simulation workloads, such sizing reduced initialization overhead by 5× compared to 4 KB pages.

3. Memory Subsystem, Data Movement, and Placement

Memory hierarchy benchmarks show Hopper achieving near-uniform L1/shared latency with Ada and Ampere at ~32–33 cycles, but distinct improvements in global and L2 cache throughput. Distributed partitioned L2 caches (e.g., 50 MB, partitioned per SM in H800) create distinct “near hit” and “far hit” access latencies (ca. 258–743 cycles), which must be modeled for accurate performance prediction and allocation. Global memory access on the H800 achieved up to 2039 GB/s; L2 throughput exceeded 4472 bytes/clk, often 4× higher than previous generations.

On multi-GH200 (e.g., Quad GH200 node in Alps), system memory forms physically heterogeneous NUMA domains. Benchmarking reveals locality as a dominant performance factor: compute-bound workloads such as GEMM achieve optimal throughput only when data is resident in the Hopper’s local HBM, falling off rapidly if data is placed elsewhere (e.g., DDR next to Grace or across the NVLink fabric). For peer-to-peer and inter-node transfers, achievable copy bandwidth drops due to additional traversals (e.g., 450 GB/s for DDR-to-DDR, halved for two traversals). Developers must explicitly manage memory placement or rely on adaptive allocation/managed memory policies, weighing ease of use against maximum performance.

4. AI and Linear Algebra Acceleration

Fourth-generation tensor cores with FP8 support, combined with asynchronous wgmma instructions, deliver up to 1500 TFLOPS/TOPS for dense MMA kernels (zero-initialized matrices, idealized conditions). Even under random data—which better mimics real-world training—throughput for wgmma instructions remains substantial, with constant kernel latencies (e.g., 128 cycles for dense wgmma) and observable benefits from DSM and TMA (Tensor Memory Accelerator) asynchronous copy APIs.

Dedicated benchmarks on libraries like the Transformer Engine demonstrate that FP8 implementations nearly double throughput vs. FP16 for large hidden sizes. Application-level metrics show substantive gains in LLM inference and training measured in tokens/s, particularly for computationally intensive kernels. Sparse wgmma instructions on Hopper deliver up to 1.42× speedup over dense counterparts, although not always meeting theoretical limits due to factors such as shared memory access contention.

In scientific computing and BLAS-heavy codes, the UMA/NVLink-C2C architecture allows performant automatic offload (e.g., SCILIB-Accel, Device First-Use policy): matrix operands are migrated on first-touch and reused hundreds of times, collapsing the migration penalty and yielding near 3× speedup over CPU baselines and better than native CUDA implementations for codes like LSMS (MuST suite).

5. Data Movement, Out-of-Core, and Mixed-Precision Strategies

The GH200’s high-bandwidth interconnect and memory coherency minimize the historic bottleneck of explicit host-device copying. For out-of-core algorithms where working sets exceed device HBM, fine-grained task-based schedulers—the PaRSEC system, static task scheduling, hardware-cached migration—allow seamless data streaming and precise overlap of transfers and computation. For mixed-precision Cholesky factorization (using tile-wise adaptive precision among FP64, FP32, FP16, FP8), the GH200 achieves near-theoretical GEMM performance and 3× speedup versus FP64-only, delivering application-accurate results in exascale climate emulation and statistical modeling benchmarks.

6. Performance Modeling, Optimization, and Workload Characterization

Performance modeling for CPU (Grace) uses a port-based in-core timing model,

$T_\mathrm{core} = \max_i \left\{ \frac{\text{uOps}_i}{\text{throughput}_i} \right\},$

and integrates with memory-side and Roofline or ECM models for node-level runtime prediction. Grace’s write-allocate evasion mechanism automatically eliminates excess traffic when initializing arrays, achieving a nearly optimal traffic ratio of 1.0 (significantly outperforming AMD Zen 4 and Intel Sapphire Rapids under default store policies). Grace also maintains a steady 3.4 GHz under vector-intensive loads, mitigating the throttling effects seen in peer architectures.

For LLM inference workloads, closely coupled (CC) GH200 outperforms PCIe A100/H100 configurations by 1.9×–2.7× in large-batch (GPU-bound) regimes. However, single-thread performance of Grace can bottleneck kernel launch/queue overhead (as measured by Total Kernel Launch and Queuing Time, TKLQT), exposing latency penalties for low-batch, CPU-bound operation. Kernel fusion, guided by proximity scoring, can cut launch counts and deliver up to 6.8× speedup for small-batch inference.

7. Practical Deployment, Application Porting, and Use Cases

The GH200’s unified programming model simplifies porting legacy and BLAS-centric applications: dynamic binary instrumentation tools such as SCILIB-Accel intercept CPU BLAS calls and offload directly to GPU cuBLAS, leveraging UMA and Device First-Use to minimize redundant migrations.

For AI, transformer-based architectures (e.g., DINO, SegFormer) see best-in-class single-chip performance gains (33–72% over A100), but older, convolutional workloads may still favor multi-GPU A100 systems. The architecture’s DMA, DSM, and TMA features extend programming flexibility, allowing for asynchronous copy and fine-grained memory layout optimizations.

Performance-transparent swapping frameworks (e.g., Pie) use the GH200 to extend effective memory for LLM inference by offloading key-value caches among CPU and GPU memory, maintaining throughput while reducing per-token latency and outperforming alternatives (e.g., vLLM, FlexGen) in both memory footprint and responsiveness. In matrix multiplication, INT8 matrix engines—combined with modular integer emulation techniques (e.g., Ozaki scheme II leveraging the Chinese Remainder Theorem)—achieve up to 1.4× (DGEMM) and 3× (SGEMM) speedups and >40–150% improvements in power efficiency, with unique scaling and accumulation strategies tailored to the GH200 architecture.

The GH200 Grace Hopper Superchip integrates advanced hardware and system features to maximize both raw computational density and workload portability. Its architectural design prioritizes low-latency inter-unit communication, flexible and unified memory addressing, and specialized acceleration (tensor, DPX, DMA/DSM) for large-scale AI and scientific applications, establishing a new reference point for tightly coupled heterogeneous computing in HPC and AI domains.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to GH200 Grace Hopper Superchip.