NVIDIA Grace Blackwell (GB10)
- NVIDIA Grace Blackwell (GB10) is a GPU-accelerated system combining advanced Grace CPU cores with Blackwell GPU architecture for deep learning and scientific computing.
- It features a unified 24 MiB L2 cache, optimized INT8/INT4 Tensor Cores, and expanded 80 GB HBM3 memory per GPU to mitigate bottlenecks.
- Innovations like Sawtooth Wavefront Reordering and fixed-point emulation achieve significant throughput gains and L2 miss reductions in mixed-precision workloads.
NVIDIA Grace Blackwell (GB10) refers to a generation of GPU-accelerated systems combining advanced Grace CPU cores with Blackwell GPU architecture. Key innovations in GB10 include large unified L2 cache, highly optimized INT8/INT4 Tensor Cores, expanded HBM3 memory subsystem (80 GB/GPU), and NVLink-4 interconnect. These features target workloads in large-scale deep learning (e.g., FlashAttention in LLMs), high-performance tensor network contractions, and mixed-precision quantum chemistry. The architectural advances of GB10 facilitate substantial increases in throughput for attention mechanisms and enable end-to-end mixed-precision computational workflows for scientific and engineering tasks.
1. Memory Architecture and L2 Reuse in GB10
GB10 implements a unified L2 cache of 24 MiB shared across 48 Streaming Multiprocessors (SMs), with critical implications for memory-bound workloads. The cache is organized in 256 B lines and 32 B sectors, providing a working set window for high-throughput access from parallel CTAs (Cooperative Thread Arrays). Each CTA can process a single tile of dimension in float16, occupying bytes. The tile-to-L2 mapping, along with the streaming nature of Q/K/V/O tensor access patterns, is particularly sensitive to cache size and access order. On GB10, the L1 cache is effectively bypassed for Q/K/V/O streaming, with L2 servicing nearly all requests—Nsight metrics report L1Tex hits consistently below of L2 sector traffic for typical sequence lengths ( to ), confirming that L2 is the relevant memory bottleneck for large-scale attention operations (Zhu et al., 22 Jan 2026).
2. CuTile FlashAttention Kernel Design and Analysis
The GB10 architecture supports high-performance split-Q, fused multi-head attention kernels, both in CUDA and via the high-level CuTile abstraction. Each CTA handles a “Q” tile, loading it into shared memory, and then streams through all K and V tiles. For each K/V tile, the kernel loads the data, computes attention scores via WMMA, updates online softmax statistics, performs the value-weighted accumulation via WMMA, and accumulates into the output . At the completion of processing over all K/V, is written back to global memory. When the working set exceeds the L2 cache capacity, repeated cold misses dominate and constrain attainable throughput. The precise L2 access model for non-causal masking is analytically described as for float16, , , configurations, with empirical results matching within $0.5$– MAPE (Zhu et al., 22 Jan 2026).
A key empirical result is that L2 “cold-miss” behavior (16S sectors for 4 tensors accessed once) persists up to a threshold (), above which measured L2 misses diverge due to cache capacity and trashing effects.
3. Sawtooth Wavefront Reordering: Reducing L2 Misses
Sawtooth Wavefront Reordering is a loop schedule transformation introduced for GB10 to mitigate capacity and trashing misses in attention kernel memory behavior. The conventional cyclic K/V scanning pattern results in reuse distances as large as tiles, which can significantly exceed L2 capacity and result in severe thrashing. Sawtooth Wavefront Reordering alternates the scan direction between ascending and descending on each Q tile iteration—thus folding the reuse window for most accesses from down to . This reduction translates analytically to , offering an estimated halving of L2 miss counts for regimes where .
The approach is illustrated as follows: for Q index , incrementally alternate between
- (“ascending”) for even
- (“descending”) for odd
This reordering is applied transparently in both CUDA and CuTile kernels. The primary effect is to halve reuse distance, thereby doubling the chance that relevant K/V blocks remain in cache when reused by other CTAs progressing in synchronized wavefront fashion.
4. Quantitative Performance and Benchmarking
The impact of Sawtooth Wavefront Reordering is evident across multiple experimental regimes:
- For CUDA-only kernels, L2 non-compulsory misses are reduced by approximately across batch sizes of $1,2,4,8$, with throughput boosted from $1.3$ TFLOPS to $2.4$ TFLOPS (+).
- For high-level CuTile implementations (, , , ), the static cyclic miss count (370M sectors) is reduced to 120M () in the sawtooth schedule.
- Throughput rises from $61$ to $69$ TFLOPS () without causal masking.
- With causal masking, throughput increases from $41$ to $66$ TFLOPS ().
- Scalability studies show similar $50$– miss reduction and $10$– throughput gain for and SM counts in .
A summary of quantitative improvements is shown below:
| Kernel/Config | L2 Miss Reduction | Throughput Gain |
|---|---|---|
| CUDA, batch=1–8 | ~50% | +85% (1.3→2.4 TFLOPS) |
| CuTile, no masking | 67% | +13% (61→69 TFLOPS) |
| CuTile, masking | – | +60% (41→66 TFLOPS) |
These gains exploit both the shared L2 size and the wavefront scheduling of GB10, which allows CTAs to reuse cache-resident K/V blocks synchronously (Zhu et al., 22 Jan 2026).
5. Fixed-Point Emulation and Mixed-Precision Tensor Algebra
GB10’s advanced INT8/INT4 Tensor Cores enable high-throughput, mixed-precision algebraic approaches, prominently the Ozaki fixed-point emulation scheme for FP64. In this framework, each IEEE FP64 number is decomposed into
where are INT8 slices, is a shared exponent, and is determined by desired mantissa fidelity ( for full FP64, or $6$ suffice for chemical accuracy). Tensor contractions, Davidson/Lanczos solvers, and SVD steps in DMRG workflows are mapped to GB10’s INT8 tensor cores via this emulation, with per-slice reduction and accumulation in INT32, final scaling, and casting to FP64 (Brower et al., 6 Oct 2025).
Practical kernel strategies include:
- Tiling for or blocks aligned with Tensor Core warp sizes.
- cuBLAS-X API enabling FP64 emulation, with controls for slice count or automatic PERFORMANT mode (adapts slicing to compute-bound subproblems).
- Overriding all main DGEMM calls with INT8-emulated kernels in DMRG workflows.
6. Application to Quantum Chemistry and Best-Practice Guidelines
Mixed-precision DMRG on GB10 achieves chemical accuracy on challenging benchmarks including FeMoco [CAS(113,76)] and CYP [CAS(63,58)], with wall-clock times and arithmetic throughput in line with or surpassing native FP64. Key results include:
- CYP: Native FP64, 3.2 h (DGX B200, 6×GB10); Emulated , 3.0 h (); , 3.1 h ()
- FeMoco: Native, 7.8 h; , 7.5 h (); , 7.7 h ()
- Arithmetic throughput in DGEMM peaks: $38$ TFLOPS (H100), $42$ TFLOPS (GB10 native), $43$ TFLOPS ( emu.), $41$ TFLOPS ( emu.)
- Accuracy: slicing delivers Ha for all bond dimensions, saturating variance. achieves Ha, stable for practical chemical applications.
Best-practice guidelines (Brower et al., 6 Oct 2025):
- Use or $6$ slices for tensor contractions and Krylov solvers.
- For SVD and density matrix diagonalization, prefer CPU FP64 for , GPU emulated .
- Set residual thresholds for ; relax to at .
- Dynamic block-state selection (DBSS) for bond dimension.
- Leverage GB10’s $1.44$ TB aggregate HBM3 memory for large active spaces; slice packing reduces memory footprint by .
Scaling improves for larger active spaces and higher bond dimensions, though, at very large , SVD and inter-GPU communication reduce the speedup from fixed-point emulation. Anticipated optimizations include INT16 Tensor Cores for SVD, hierarchical slicing (e.g., for DGEMM, for correction), asynchronous exponent/slice operations, and AI-driven auto-tuning of and during sweeps.
7. Architectural Implications and Limitations
The effectiveness of cache and tensor contraction optimizations on GB10 is fundamentally tied to its unified L2 cache and highly synchronized SM wavefront scheduling. Large, cache-centric reordering like Sawtooth Wavefront Reordering leverages this determinism for near-optimal data reuse, achieving significant reductions in L2 misses and corresponding throughput gains without hardware modification (Zhu et al., 22 Jan 2026). No hardware changes are required; the improvements come entirely from software-level scheduling. However, should tile sizes increase or architectures shift towards larger or NUMA-distributed L2, cache reuse patterns and the benefits of current loop-reordering schemes may change. Greater L1 prefetching or less predictable SM scheduling in future GPU generations will likely require adaptation of these strategies.
For tensor network workloads, the Ozaki emulation’s gain is bounded by the proportion of workload dominated by DGEMM; as memory or communication bottlenecks increase with problem scale, overall speedup is limited. Energy error plateaus at the sum of block-state truncation and fixed-point rounding; achieving micro-Hartree precision on intractably large active spaces could require additional hardware support or a fallback to full FP64 in precision-sensitive routines (Brower et al., 6 Oct 2025).
References
- "Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10" (Zhu et al., 22 Jan 2026)
- "Mixed-precision ab initio tensor network state methods adapted for NVIDIA Blackwell technology via emulated FP64 arithmetic" (Brower et al., 6 Oct 2025)