Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVIDIA GB10 Grace Blackwell Architecture

Updated 29 January 2026
  • NVIDIA GB10 (Grace Blackwell) is a unified CPU-GPU architecture featuring a dual-die design and innovative tensor core support for enhanced HPC and ML performance.
  • The architecture integrates a high-bandwidth 192 GB HBM3e memory system and optimized L2/TMEM subsystems to reduce latency and maximize throughput.
  • Empirical benchmarks demonstrate up to 1.56× mixed-precision throughput and 1.42× energy efficiency improvements over previous Hopper-generation GPUs.

NVIDIA GB10, codenamed "Grace Blackwell," is a unified CPU-GPU architecture noted for its dual-die design, advanced low-precision tensor core support, high-bandwidth HBM3e memory system, and novel execution and memory subsystem optimizations. GB10 targets large-scale machine learning, high-performance computing (HPC), and exascale workloads by introducing significant microarchitectural improvements over previous NVIDIA Hopper (H100) generation hardware, as substantiated through empirical microbenchmarking and performance modeling (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025, Zhu et al., 22 Jan 2026).

1. Microarchitectural Overview

NVIDIA GB10 features a dual-chip (dual-die) design: two identical GPU “slices,” each integrating 74 SMs (148 SMs total, organized into 8 GPCs) are interconnected via NV-HBI in a unified address space spanning 192 GB HBM3e, served by 8 memory stacks (4 per die). Each die contains four L2 partitions, and the design leverages 208 billion transistors. GB10 is fabricated at TSMC 3 nm, positioning it at the forefront of density scaling and process node efficiency (Jarmusch et al., 1 Dec 2025, Zhu et al., 22 Jan 2026).

Critical hardware components and their architectural context include:

  • 5th-Generation Tensor Cores: Support scalar and matrix operations in FP64, FP32, FP16, BF16, FP8 (E4M3), FP6 (E3M2/E2M3), FP4 (E2M1), INT8, and INT4. The tcgen05.mma instruction provides per-thread matrix-multiply-accumulate (MMA), decoupled from warp-level synchronization. Single-instruction latency for all supported tile sizes is approximately 11 cycles, compared to up to 128 cycles for prior Hopper wgmma, significantly increasing throughput for both dense and sparse GEMM patterns (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).
  • Tensor Memory (TMEM): Each SM integrates 256 KB TMEM, implemented as a 2D array (512 columns × 128 lanes of 32-bit cells), providing 16 TB/s read and 8 TB/s write bandwidth per SM, additive to L1/shared memory performance.
  • Decompression Engine (DE): Hardware supports multiple codecs (LZ4, Snappy, Zstandard, GZIP, Cascaded, Bitcomp, ANS), achieving input throughput of 173 GB/s (uncompressed) and output throughput of 539 GB/s (<1.3 ms for 100 MB workloads) (Jarmusch et al., 1 Dec 2025).

A typical dataflow involves HBM3e global memory → L2 cache → L1/shared memory → SM, with accelerated transfers to and from TMEM (tcgen05.cp, tcgen05.ld/st). The DE allows on-the-fly tensor decompression with minimal latency overhead.

2. Memory Hierarchy and Subsystem Innovations

The GB10 memory hierarchy is optimized to mitigate bottlenecks in deep learning and scientific applications:

  • Global Memory: 192 GB HBM3e, with measured peak bandwidth up to 8.2 TB/s (read) and 1.6 TB/s (write) for GDDR7 configurations, and LPDDR5X with ~301 GB/s raw bandwidth for SoC variants (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025, Zhu et al., 22 Jan 2026).
  • L2 Cache: Unified 65 MB L2 for GPU, 24 MiB for SoC variant. L2 hit latency is measured at 358 cycles (light load) rising to 508 cycles under saturation (e.g., 45 MB active footprint). L2 behavior dominates streaming-attention and tile-based matrix workloads (Jarmusch et al., 14 Jul 2025, Zhu et al., 22 Jan 2026).
  • L1/Texture and Shared Memory: 128 KB per SM (GPU), up to 192 KiB per SM (SoC), with hit latency ≈30–40 cycles. GB10 makes L1/share configurable via CUDA attributes, with a tunable default shared memory (Jarmusch et al., 14 Jul 2025).
  • Tensor Memory (TMEM): Reduced memory access latency for TMEM-misses vs. global memory by 58% (from 1000 cycles to 420 cycles). L1/arbitration-style TMEM hits occur at ≈50 cycles (Jarmusch et al., 1 Dec 2025).
  • Impact on Algorithm Design: The increased TMEM and L2 sizes raise the optimal tile sizes for matrix work from 32×32 (Hopper) to 64×64 (GB10), maximizing TMEM usage and reducing global memory writes up to 12 TB/s per SM in attention-style kernels (Jarmusch et al., 1 Dec 2025, Zhu et al., 22 Jan 2026).

3. Tensor Core Throughput and Mixed-Precision Capabilities

GB10’s 5th-generation tensor cores drive unprecedented mixed-precision throughput and energy efficiency:

Precision Theoretical Peak Throughput (TFLOPS) Relative to H200 (where supported)
FP4 7702.5 N/A (new on Blackwell)
FP6 5134.8 N/A (new on Blackwell)
FP8 3851.4 1.27×
FP16 1929.2 1.27×
BF16 1926.8 1.27×
FP32 481.2 1.27×
FP64 44.8
  • Mixed-precision throughput gains: 1.56× over H200 (Hopper) (Jarmusch et al., 1 Dec 2025).
  • Energy efficiency gain: 42% (η_B200 / η_H200 = 1.42) (Jarmusch et al., 1 Dec 2025).
  • Microbenchmark results: FP4/FP6/FP8 microbenchmark throughput exceeds 11 TFLOP/s per SM; FP4 and FP6 provide ≈20% lower power and ≈15% higher throughput vs. FP8 for inference-oriented kernels (Jarmusch et al., 14 Jul 2025).
  • Latency: Single-instruction latency of 11 cycles for m64×64×16 FP16 operations versus 32 cycles (Hopper wgmma) (Jarmusch et al., 1 Dec 2025).

FP4 and FP6 are natively supported in hardware, with QMMA or OMMA kernel mapping and are recommended for inference where modest losses in model accuracy or perplexity are acceptable. The architecture supports flexible PTX decomposition, decoupling data movement and computation at fine granularity.

4. Empirical Workload Evaluations

Microbenchmark and end-to-end workload studies across GB10 yield the following key findings (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025, Zhu et al., 22 Jan 2026):

Workload B200 (GB10) H200 (Hopper) Speedup Factor
Dense GEMM (FP64, ≥16k²) 36.3 TFLOPS 18.9 TFLOPS 1.92×
STREAM Triad (Mem BW, >64GB) 7.48 TB/s 4.38 TB/s 1.71×
Sparse GEMV + DE 5.04 GFLOPS 3.20 GFLOPS 1.58×
Transformer Inference (FP8, B32, S=2048) 78,400 tok/s 49,200 tok/s 1.59×
Mixed-Precision Training (GPT-1.3B) 14,397 tok/s 9,240 tok/s 1.56×
Energy Efficiency (GPT-1.3B) 22.2 tok/s/W 15.6 tok/s/W 1.42×

Transformer models (e.g., Mistral-7B, Mixtral-8×7B) realize 1.52–1.59× lower inference latency at batch sizes of 1 and 32. B200’s support for FP4 yields up to 2.5× inference speedup with a perplexity increase of +8.2% at extreme quantization (FP4 vs. FP8: ΔPPL +8.2%) (Jarmusch et al., 1 Dec 2025). A plausible implication is that, in deployment scenarios where model quality loss is tolerable, FP4 can be preferentially used in non-critical layers to maximize speed.

5. Software and Programming and Kernel Optimization

GB10 exposes new PTX instructions for TMEM (tcgen05.cp, tcgen05.ld/st), enabling asynchronous global↔TMEM copies and flexible tile decomposition, as well as DE primitives for data decompression. Tile sizes optimized at 64×64 (FP8) or T=80 (CuTile) maximize TMEM and shared memory use (Jarmusch et al., 1 Dec 2025, Zhu et al., 22 Jan 2026).

For attention kernels and GEMM:

  • Tile-based Decomposition: Optimal for both dense and sparse matrix multiplication. Developers are advised to double-buffer TMEM tiles and aggressively overlap data movement with computation to approach 95–99% of peak throughput.
  • DE Kernel Integration: Enables asynchronous decompression of large weight tensors during compute, with >85% pipeline efficiency for appropriate chunk sizes (Jarmusch et al., 1 Dec 2025).
  • Precision Selection: Use FP8 in critical paths (e.g., QKV projection in transformers), apply FP4 otherwise, while limiting FP4 deployment to layers with acceptable perplexity drift (ΔPPL < 3%) (Jarmusch et al., 1 Dec 2025).
  • Warp Scheduler Tuning: Expose at least 9 independent operations per thread and use ~32 warps per block to saturate tensor core throughput (Jarmusch et al., 14 Jul 2025).

Compiler recommendations include scheduling to maintain steady ILP, generate mma.sync instructions for FP4/FP6, and avoid shared memory bank conflicts (stride ≥ 4) that can induce high latency (Jarmusch et al., 14 Jul 2025).

6. Memory Access Optimization: Attention Kernels and Sawtooth Wavefront

High-performance attention (e.g., FlashAttention) kernels on GB10 are dominated by L2 cache behavior due to the large KV working set characteristic of LLMs (Zhu et al., 22 Jan 2026). Standard cyclic KV tile access patterns result in high reuse distances and L2 conflict/capacity misses when working sets exceed L2 capacity (e.g., S ≈ 80k, KV size ≈ 20 MiB for D=64), sharply degrading throughput.

Sawtooth Wavefront Reordering, an access scheduling technique alternating forward and backward KV tile iteration between cooperating thread arrays (CTAs), halves the mean reuse distance. This pattern empirically reduces L2 misses by 50–67% and boosts throughput by up to 60% on GB10 for both CUDA and CuTile implementations.

Table: GB10 FlashAttention Sawtooth Optimization Experimental Results (S=32k, D=64)

Configuration Cyclic L2 Miss (×10⁶) Sawtooth L2 Miss (×10⁶) Reduction Orig TFLOPS Opt TFLOPS Speedup
CUDA, B=1 108.0 54.0 50% 1.3 2.4 1.85×
CuTile non-causal 370 120 67% 61 69 1.13×
CuTile causal 380 150 60% 41 66 1.61×

The benefit is maximized when sequence length is large and working set size exceeds L2 capacity. Slight divergence overhead (~5%) may occur when working set fits L2. The sawtooth ordering approach is architecture-targeted, generalizing to other tiled kernels and potentially to multi-GPU/NUMA schemes (Zhu et al., 22 Jan 2026).

7. Comparative Analysis and Architectural Significance

Relative to prior Hopper architecture (H100/H200), Grace Blackwell (GB10) demonstrates:

  • Substantially increased mixed-precision throughput (up to 1.56×) and energy efficiency (up to 1.42×).
  • Lower memory miss latency due to enlarged TMEM and L2; optimal tile sizes shift upward, and kernel-level data movement patterns must be retuned for maximum performance (Jarmusch et al., 1 Dec 2025).
  • A fundamental trade-off: reduced L1 capacity and external GDDR7 bandwidth in some SKUs, but larger L2 cache, higher TMEM bandwidth, and support for new ultra-low-precision datatypes (FP4, FP6) (Jarmusch et al., 14 Jul 2025).
  • Microbenchmark evidence that GB10’s more conservative and consistent warp scheduling, larger register file, and aggressive support for low-precision datatypes translate to smoother throughput scaling as ILP increases (Jarmusch et al., 14 Jul 2025).
  • System-level innovations (coherent GPU/CPU SoC, unified memory, improved cooperative thread array control) support a range of workload patterns not efficiently handled on previous architectures (Zhu et al., 22 Jan 2026).

A plausible implication is that future algorithmic and software optimizations for GB10-class hardware will need to operate at larger tile-and-batch granularities, leverage dynamic precision selection, and exploit advanced memory scheduling.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVIDIA GB10 (Grace Blackwell).