Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVIDIA Blackwell Tensor Cores

Updated 14 June 2026
  • Blackwell Tensor Cores are NVIDIA's fifth-generation MMA units supporting ultra-low-precision formats (FP4, FP6) to boost throughput in AI and HPC.
  • They integrate dedicated Tensor Memory and a hardware Decompression Engine to significantly reduce memory bottlenecks and latency.
  • Innovative per-warp scheduling and advanced FP64 emulation techniques enable both high-fidelity scientific computations and energy-efficient performance.

NVIDIA Blackwell Tensor Cores refer to the fifth generation of hardware matrix-multiply–accumulate (MMA) units integrated within the NVIDIA Blackwell GPU architecture. Built to address compute and memory bottlenecks in exascale AI and high-performance computing (HPC) workloads, Blackwell Tensor Cores advance the architectural paradigm with native ultra-low-precision floating-point support (FP4, FP6), significantly revised per-thread scheduling, dedicated Tensor Memory (TMEM), and a hardware Decompression Engine (DE). These centers of computation and dataflow enable the Blackwell GPU to achieve unprecedented throughput and energy efficiency on dense/sparse GEMM, transformer inference, and scientific emulation workloads, including reliable, high-fidelity double-precision (FP64) emulation through integer-based and Ozaki-style algorithms.

1. Architectural Design and Microarchitecture

Blackwell’s 5th-generation Tensor Cores exhibit a complex microarchitectural overhaul relative to prior architectures such as Hopper. Each Streaming Multiprocessor (SM) contains four specialized tensor-core sub-cores. Every sub-core executes one warp-level MMA per cycle, reading tiles from per-warp register files into a pipelined multiply–accumulate datapath and accumulating results in dedicated register files. The internal datapath is architected for wide support of low-precision instructions (FP4, FP6, FP8, FP16/BF16, and INT8), with accumulation in FP16 or FP32, depending on precision mode.

A core innovation in Blackwell is the transition from Hopper’s 128-thread warp-grouped execution model ("wgmma") to a true 32-thread warp-level operation, eliminating cross-warp synchronization bottlenecks and enabling finer-grained, per-warp tensor-core scheduling. The new tcgen05.mma instruction unifies these operations and encodes low-precision tile shapes such as m16n8k32. Distinct SASS micro-ops (DMMA, HMMA, QMMA, OMMA, IMMA) are mapped to high-level instructions, affording broad support across the precision spectrum (Jarmusch et al., 1 Dec 2025, Jarmusch et al., 14 Jul 2025).

2. Supported Numeric Formats and Throughput

Blackwell Tensor Cores introduce native support for ultra-low-precision data types, offering unprecedented theoretical and sustained throughput across a range of floating-point precisions:

Precision Throughput (B200, TFLOPS) Throughput (H200, TFLOPS) Speedup
FP64 44.8 (99.6% peak) 34.0 1.32×
FP32 481.2 (96.2%) 378.4 1.27×
FP16 1929.2 (96.5%) 1515.2 1.27×
FP8 3851.4 (96.3%) 3026.9 1.27×
FP6 5134.8 (95.8%) New
FP4 7702.5 (96.3%) New

The support for FP4 (e2m1) and FP6 (e3m2/e2m3) is unique to this generation. Throughput for FP4 is four times higher than FP8; FP6 throughput is close to FP8. INT8 throughput is also supported, reflecting continued utility in AI and quantized inference. FP8 GEMM only supports one input transposed, and accumulation throughput is maximized when using FP16 rather than FP32. Emulation methods and algorithmic techniques now exploit these low-precision pathways to achieve high-speed scientific computations that would traditionally require FP64 units (Jarmusch et al., 1 Dec 2025, Mukunoki, 1 Aug 2025).

3. Memory Subsystems: TMEM and Decompression Engine

Blackwell introduces two dedicated hardware subsystems for feeding Tensor Cores:

  • Tensor Memory (TMEM): A 256 KiB per-SM, highly parallel buffer with 16 TB/s read and 8 TB/s write bandwidth. TMEM is accessed via tcgen05.cp, tcgen05.ld/st instructions (not via cp.async), enabling rapid double-buffered, direct tile staging for Tensor Core instructions. TMEM dramatically reduces cache-miss latency (420 cycles vs ~1000 cycles to H200 global memory—a 58% reduction), and delivers up to 8 TB/s for FP8 tile transfers. This enables new algorithmic decompositions—such as 64×64 tiling—to saturate throughput (Jarmusch et al., 1 Dec 2025).
  • Decompression Engine (DE): A hardware block sustaining up to ~462 GB/s of decompressed data delivery (e.g., LZ4, Zstandard, ANS). The DE operates at sub-millisecond latency, supporting decompression for highly compressed weight formats in DNN inference and training. DE is output-bandwidth limited, and typical optimal settings are 128–256 KiB chunk sizes and 4–8 concurrent streams (Jarmusch et al., 1 Dec 2025).

These enhancements, in concert, shift the optimal data-movement and scheduling strategies for AI and HPC workloads—favoring TMEM-staged, on-the-fly decompression and minimizing the reliance on shared memory and global memory latency amortization schemes.

4. Algorithmic and FP64 Emulation Advances

Due to the immense throughput gap between low-precision and FP64 paths in modern GPUs (with Blackwell’s native FP64 at ~1% of FP32 throughput), scientific computing algorithms have migrated towards FP64 emulation. The dominant technique is the Ozaki decomposition (and its extensions), which slices a high-precision FP64 operand or matrix into several low-precision "slices" (typically INT8 or FP8/BF16), processes all pairwise products independently using Tensor Cores, and reconstructs the FP64 result. Integer-based emulation further boosts performance in hardware lacking efficient FP64 units.

Recent research introduces Exponent Span Capacity (ESC) to determine padding and slice count for robust emulation, and unsigned-slice encoding to maximize representational efficiency. The Automatic Dynamic Precision (ADP) framework adds device-side heuristics, Inf/NaN detection, and cost analyses, yielding up to 2.3× and 13.2× speedups in (emulated) DGEMM on Blackwell GB200 and RTX Pro 6000 Server, respectively, over native FP64—while guaranteeing BLAS-grade FP64 accuracy (Schwarz et al., 16 Nov 2025, Mukunoki, 1 Aug 2025). Slice minimization (e.g., 7 slices for 53 bits mantissa with unsigned encoding) lowers compute/memory pressure.

For applications demanding high-accuracy linear algebra, DMRG, and quantum chemistry, this approach achieves chemical-accuracy results on practical problem sizes while saturating Tensor Core throughput (Brower et al., 6 Oct 2025). The variational nature of certain algorithms (e.g., DMRG) makes them especially suitable for adaptive mixed-precision execution.

5. Mixed-Precision Training, Inference, and Scientific Workloads

Blackwell Tensor Cores enable flexible precision choice at the hardware level:

  • Training: Mixed-precision (e.g., FP16 input, FP16 accumulation) maximizes throughput. Empirical studies show a 1.56× end-to-end throughput improvement and 1.42× energy efficiency gain versus H200 in transformer (GPT-1.3B) and image (ResNet-50) training (Jarmusch et al., 1 Dec 2025).
  • Inference: Using lower-precision (FP8, FP6, FP4) formats allows 1.73×–2.5× speedup at only minor model-perplexity degradation (≤+9%), suggesting these formats are suitable for large model deployment where power and latency are critical.

For FP32 emulation, Blackwell’s BFloat16 ("BF16×9") scheme leverages three BF16 digits, per-lane scaling, and integrated tcgen05.mma instruction features to achieve GEMM performance 2–3× higher than native FP32 with even lower power consumption and equivalent or better accuracy—including in the presence of subnormals, NaNs, and Infs (Bayraktar et al., 15 May 2026). This supports high-performance numerical linear algebra (SGEMM) and scientific applications such as CCSD, DMRG, and spectral transforms essential in physics and quantum chemistry.

6. Developer Optimization Strategies

Efficiently utilizing Blackwell Tensor Cores requires explicit attention to tile sizes, thread scheduling, and memory/data transfer orchestration:

  • Tile size: Use 64×64 for TMEM, 16×8×32 or 8×8×16 for instruction-level optimization.
  • ILP and warp count: Target ILP≈6 per thread, ~32 warps per SM to saturate pipelines; for latency-sensitive kernels, tune ILP downward if fewer warps are present (Jarmusch et al., 14 Jul 2025).
  • Memory staging: Preload matrix tiles into TMEM using tcgen05.cp instructions with double-buffering for maximum pipeline utilization.
  • Precision strategy: Favor lowest feasible precision for inference; use ADP or ESC heuristics to auto-select slice/precision configuration for emulated high-precision workloads.
  • Energy efficiency: Select FP4/FP6 where dynamic range suffices (FP4: ~16.7 W, FP6: ~39 W per GB203). Compile mixed-precision MMAs with explicit PTX suffixes for optimal power–performance ratios.

Automatic, hardware-agnostic selection frameworks such as ADP enable high-fidelity emulation while maintaining minimal run-time overhead and obviate the need for user calibration or host synchronization, thus democratizing access to the high-throughput pathways in contemporary scientific codes (Schwarz et al., 16 Nov 2025).

7. Outlook and Impact

The architectural innovations of Blackwell Tensor Cores—per-thread scheduling, ultra-low-precision support, fast TMEM, and hardware decompression—shift the landscape in both AI and scientific computing. As future GPUs further de-prioritize native FP64 and push sub-8-bit FP performance, software methods like Ozaki decomposition, adaptive precision emulation, and dynamic slice selection will become integral to bridging AI-specific and HPC workloads.

Research trends point to expanded use in complex tensor networks, mixed-precision eigensolvers, and robust dense linear algebra libraries, with integration and fallback logic ensuring both performance and numerical reliability. The Blackwell approach demonstrates that exascale-class low-precision accelerators, judiciously combined with provably-accurate emulation and dynamic hardware abstraction, establish a foundation for both AI and high-fidelity traditional scientific workloads (Jarmusch et al., 1 Dec 2025, Schwarz et al., 16 Nov 2025, Mukunoki, 1 Aug 2025).


Key References: (Jarmusch et al., 1 Dec 2025, Mukunoki, 1 Aug 2025, Brower et al., 6 Oct 2025, Jarmusch et al., 14 Jul 2025, Schwarz et al., 16 Nov 2025, Bayraktar et al., 15 May 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blackwell Tensor Cores.