Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVIDIA Grace Blackwell (GB10)

Updated 30 March 2026
  • NVIDIA Grace Blackwell (GB10) is a GPU-accelerated system combining advanced Grace CPU cores with Blackwell GPU architecture for deep learning and scientific computing.
  • It features a unified 24 MiB L2 cache, optimized INT8/INT4 Tensor Cores, and expanded 80 GB HBM3 memory per GPU to mitigate bottlenecks.
  • Innovations like Sawtooth Wavefront Reordering and fixed-point emulation achieve significant throughput gains and L2 miss reductions in mixed-precision workloads.

NVIDIA Grace Blackwell (GB10) refers to a generation of GPU-accelerated systems combining advanced Grace CPU cores with Blackwell GPU architecture. Key innovations in GB10 include large unified L2 cache, highly optimized INT8/INT4 Tensor Cores, expanded HBM3 memory subsystem (80 GB/GPU), and NVLink-4 interconnect. These features target workloads in large-scale deep learning (e.g., FlashAttention in LLMs), high-performance tensor network contractions, and mixed-precision quantum chemistry. The architectural advances of GB10 facilitate substantial increases in throughput for attention mechanisms and enable end-to-end mixed-precision computational workflows for scientific and engineering tasks.

1. Memory Architecture and L2 Reuse in GB10

GB10 implements a unified L2 cache of 24 MiB shared across 48 Streaming Multiprocessors (SMs), with critical implications for memory-bound workloads. The cache is organized in 256 B lines and 32 B sectors, providing a working set window for high-throughput access from parallel CTAs (Cooperative Thread Arrays). Each CTA can process a single tile of dimension T×dT \times d in float16, occupying Td2T \cdot d \cdot 2 bytes. The tile-to-L2 mapping, along with the streaming nature of Q/K/V/O tensor access patterns, is particularly sensitive to cache size and access order. On GB10, the L1 cache is effectively bypassed for Q/K/V/O streaming, with L2 servicing nearly all requests—Nsight metrics report L1Tex hits consistently below 0.03%0.03\% of L2 sector traffic for typical sequence lengths (S=32KS=32\text{K} to 128K128\text{K}), confirming that L2 is the relevant memory bottleneck for large-scale attention operations (Zhu et al., 22 Jan 2026).

2. CuTile FlashAttention Kernel Design and Analysis

The GB10 architecture supports high-performance split-Q, fused multi-head attention kernels, both in CUDA and via the high-level CuTile abstraction. Each CTA handles a “Q” tile, loading it into shared memory, and then streams through all NKVN_{KV} K and V tiles. For each K/V tile, the kernel loads the data, computes attention scores SijS_{ij} via WMMA, updates online softmax statistics, performs the value-weighted accumulation PijVjP_{ij}V_j via WMMA, and accumulates into the output OiO_i. At the completion of processing over all K/V, OiO_i is written back to global memory. When the working set S/TMtileS/T \cdot M_\text{tile} exceeds the L2 cache capacity, repeated cold misses dominate and constrain attainable throughput. The precise L2 access model for non-causal masking is analytically described as M(S,T)=8S(1+S/T)M(S,T) = 8S(1 + S/T) for float16, d=64d=64, E=2E=2, C=32C=32 configurations, with empirical results matching within $0.5$–2.5%2.5\% MAPE (Zhu et al., 22 Jan 2026).

A key empirical result is that L2 “cold-miss” behavior (16S sectors for 4 tensors accessed once) persists up to a threshold (S80KS \sim 80\,\text{K}), above which measured L2 misses diverge due to cache capacity and trashing effects.

3. Sawtooth Wavefront Reordering: Reducing L2 Misses

Sawtooth Wavefront Reordering is a loop schedule transformation introduced for GB10 to mitigate capacity and trashing misses in attention kernel memory behavior. The conventional cyclic K/V scanning pattern results in reuse distances as large as NKVN_{KV} tiles, which can significantly exceed L2 capacity and result in severe thrashing. Sawtooth Wavefront Reordering alternates the scan direction between ascending and descending on each Q tile iteration—thus folding the reuse window for most accesses from NKVN_{KV} down to NKV/2N_{KV}/2. This reduction translates analytically to Msaw(S,T)8S(1+S/(2T))M_{\mathrm{saw}}(S,T) \approx 8S(1 + S/(2T)), offering an estimated halving of L2 miss counts for regimes where S/T1S/T \gg 1.

The approach is illustrated as follows: for Q index ilocali_{\mathrm{local}}, incrementally alternate between

  • j=0NKV1j=0\ldots N_{KV}-1 (“ascending”) for even ilocali_{\mathrm{local}}
  • j=NKV10j=N_{KV}-1\ldots 0 (“descending”) for odd ilocali_{\mathrm{local}}

This reordering is applied transparently in both CUDA and CuTile kernels. The primary effect is to halve reuse distance, thereby doubling the chance that relevant K/V blocks remain in cache when reused by other CTAs progressing in synchronized wavefront fashion.

4. Quantitative Performance and Benchmarking

The impact of Sawtooth Wavefront Reordering is evident across multiple experimental regimes:

  • For CUDA-only kernels, L2 non-compulsory misses are reduced by approximately 50%50\% across batch sizes of $1,2,4,8$, with throughput boosted from $1.3$ TFLOPS to $2.4$ TFLOPS (+85%85\%).
  • For high-level CuTile implementations (batch=8\text{batch}=8, S=128KS=128\,\text{K}, d=64d=64, T=64T=64), the static cyclic miss count (\sim370M sectors) is reduced to \sim120M (67%-67\%) in the sawtooth schedule.
    • Throughput rises from $61$ to $69$ TFLOPS (+13%+13\%) without causal masking.
    • With causal masking, throughput increases from $41$ to $66$ TFLOPS (+60%+60\%).
  • Scalability studies show similar $50$–67%67\% miss reduction and $10$–60%60\% throughput gain for S[32K,128K]S \in [32\,\text{K},128\,\text{K}] and SM counts in [8,48][8,48].

A summary of quantitative improvements is shown below:

Kernel/Config L2 Miss Reduction Throughput Gain
CUDA, batch=1–8 ~50% +85% (1.3→2.4 TFLOPS)
CuTile, no masking 67% +13% (61→69 TFLOPS)
CuTile, masking +60% (41→66 TFLOPS)

These gains exploit both the shared L2 size and the wavefront scheduling of GB10, which allows CTAs to reuse cache-resident K/V blocks synchronously (Zhu et al., 22 Jan 2026).

5. Fixed-Point Emulation and Mixed-Precision Tensor Algebra

GB10’s advanced INT8/INT4 Tensor Cores enable high-throughput, mixed-precision algebraic approaches, prominently the Ozaki fixed-point emulation scheme for FP64. In this framework, each IEEE FP64 number xx is decomposed into

A=s=1SA(s)2eA8(s1)A = \sum_{s=1}^{S} A^{(s)} 2^{e_A - 8(s-1)}

where A(s)A^{(s)} are INT8 slices, eAe_A is a shared exponent, and SS is determined by desired mantissa fidelity (S=7S=7 for full FP64, S=4S=4 or $6$ suffice for chemical accuracy). Tensor contractions, Davidson/Lanczos solvers, and SVD steps in DMRG workflows are mapped to GB10’s INT8 tensor cores via this emulation, with per-slice reduction and accumulation in INT32, final scaling, and casting to FP64 (Brower et al., 6 Oct 2025).

Practical kernel strategies include:

  • Tiling for 128×128128\times 128 or 256×256256\times 256 blocks aligned with Tensor Core warp sizes.
  • cuBLAS-X API enabling FP64 emulation, with controls for slice count or automatic PERFORMANT mode (adapts slicing to compute-bound subproblems).
  • Overriding all main DGEMM calls with INT8-emulated kernels in DMRG workflows.

6. Application to Quantum Chemistry and Best-Practice Guidelines

Mixed-precision DMRG on GB10 achieves chemical accuracy on challenging benchmarks including FeMoco [CAS(113,76)] and CYP [CAS(63,58)], with wall-clock times and arithmetic throughput in line with or surpassing native FP64. Key results include:

  • CYP: Native FP64, 3.2 h (DGX B200, 6×GB10); Emulated S=4S=4, 3.0 h (6%-6\%); S=6S=6, 3.1 h (3%-3\%)
  • FeMoco: Native, 7.8 h; S=4S=4, 7.5 h (4%-4\%); S=6S=6, 7.7 h (1%-1\%)
  • Arithmetic throughput in DGEMM peaks: $38$ TFLOPS (H100), $42$ TFLOPS (GB10 native), $43$ TFLOPS (S=4S=4 emu.), $41$ TFLOPS (S=6S=6 emu.)
  • Accuracy: S=6S=6 slicing delivers ΔE1×106\Delta E \leq 1\times 10^{-6} Ha for all bond dimensions, saturating variance. S=4S=4 achieves ΔE1×105\Delta E \leq 1\times 10^{-5} Ha, stable for practical chemical applications.

Best-practice guidelines (Brower et al., 6 Oct 2025):

  • Use S=4S=4 or $6$ slices for tensor contractions and Krylov solvers.
  • For SVD and density matrix diagonalization, prefer CPU FP64 for S3S\leq 3, GPU emulated S4S\geq 4.
  • Set residual thresholds ϵ=105\epsilon=10^{-5} for S3S\geq 3; relax to 10410^{-4} at S=2S=2.
  • Dynamic block-state selection (DBSS) for bond dimension.
  • Leverage GB10’s $1.44$ TB aggregate HBM3 memory for large active spaces; slice packing reduces memory footprint by >20%>20\%.

Scaling improves for larger active spaces and higher bond dimensions, though, at very large D104D \gg 10^{4}, SVD and inter-GPU communication reduce the speedup from fixed-point emulation. Anticipated optimizations include INT16 Tensor Cores for SVD, hierarchical slicing (e.g., S=4S=4 for DGEMM, S=7S=7 for correction), asynchronous exponent/slice operations, and AI-driven auto-tuning of ϵ\epsilon and SS during sweeps.

7. Architectural Implications and Limitations

The effectiveness of cache and tensor contraction optimizations on GB10 is fundamentally tied to its unified L2 cache and highly synchronized SM wavefront scheduling. Large, cache-centric reordering like Sawtooth Wavefront Reordering leverages this determinism for near-optimal data reuse, achieving significant reductions in L2 misses and corresponding throughput gains without hardware modification (Zhu et al., 22 Jan 2026). No hardware changes are required; the improvements come entirely from software-level scheduling. However, should tile sizes increase or architectures shift towards larger or NUMA-distributed L2, cache reuse patterns and the benefits of current loop-reordering schemes may change. Greater L1 prefetching or less predictable SM scheduling in future GPU generations will likely require adaptation of these strategies.

For tensor network workloads, the Ozaki emulation’s gain is bounded by the proportion of workload dominated by DGEMM; as memory or communication bottlenecks increase with problem scale, overall speedup is limited. Energy error plateaus at the sum of block-state truncation and fixed-point rounding; achieving micro-Hartree precision on intractably large active spaces could require additional hardware support or a fallback to full FP64 in precision-sensitive routines (Brower et al., 6 Oct 2025).

References

  • "Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10" (Zhu et al., 22 Jan 2026)
  • "Mixed-precision ab initio tensor network state methods adapted for NVIDIA Blackwell technology via emulated FP64 arithmetic" (Brower et al., 6 Oct 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVIDIA Grace Blackwell (GB10).