Papers
Topics
Authors
Recent
2000 character limit reached

Fifth-Gen Tensor Cores Overview

Updated 12 December 2025
  • Fifth-generation tensor cores are specialized GPU units that accelerate low- and mixed-precision arithmetic through innovative fused multiply-add and dot-product algorithms.
  • They support a diverse range of data types—including FP64, FP32, TF32, FP16, BF16, and FP8—with distinct hardware execution paths to optimize performance for deep learning and scientific computing.
  • Architectural enhancements such as increased accumulator width, extra guard bits, and unified operation steps improve numerical reproducibility and stability across varied workloads.

Fifth-generation tensor cores denote the class of matrix-multiplication accelerators introduced with NVIDIA’s Hopper architecture (H100) and Blackwell architecture (B200), serving as the principal computational units for low- and mixed-precision linear algebra in modern GPUs. These cores implement non-IEEE 754–compliant, architecture-specific algorithms for fused multiply-add (FMA) and dot-product computation in a variety of floating-point formats, providing critical throughput and efficiency gains for both deep learning and scientific computing tasks. The fifth generation advances arithmetic fidelity, extends supported data types, and introduces nuanced accumulator and rounding semantics, resulting in measurable consequences for numerical reproducibility, algorithm design, and stability across workloads (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

1. Supported Data Types and Instruction Variants

Fifth-generation tensor cores support multiple data representations, each mapped to distinct hardware execution paths and matrix tile sizes. On Hopper (H100), the following configurations are realized:

  • FP64 (double precision): DMMA instructions (e.g., DMMA.16x8x16) providing IEEE-754 compliant sequential-FMA with a 52-bit mantissa and round-to-nearest-even.
  • FP32 (single precision): HMMA instructions (e.g., HMMA.16816.F32, HMMA.1688.F32) for standard and TensorFloat32 formats, leveraging 25-bit internal accumulators.
  • TF32 (TensorFloat-32): HMMA/HGMMA variants offering larger block-FMA tiles with 25-bit internal precision.
  • FP16 & bfloat16 (bf16): HMMA instructions for half-precision and brain float, with tiling at N=16 and wide internal accumulation.
  • FP8: QGMMA paths in Hopper (E4M3 and E5M2) with 13-bit internal precision, available for both fp32 and fp16 output targets.
  • Warp-group-level HGMMA instructions: Tiling larger matrix subregions for higher aggregate throughput, retaining the same arithmetic model.

The B200 (Blackwell) broadens input format support to all low-precision CUDA types except FP64, including FP8-E4M3, FP8-E5M2, IEEE binary16 (fp16), bfloat16 (bf16), and TensorFloat32 (tf19). For tf19, the block-FMA size is four; for fp8, two interleaved 16-term partial sums are used to compute a 32-term inner product (Khattak et al., 7 Dec 2025).

2. Internal Accumulator Width and Alignment

Internal accumulator width in fifth-generation tensor cores is format-dependent and dictates the attainable arithmetic fidelity in large dot products.

  • Hopper (H100): FP32/TF32/FP16/BF16 modes utilize a fused-dot-add (FDA) algorithm with a 25-bit fractional accumulator, an increment over previous 24-bit designs. For FP8 QGMMA instructions, the accumulator is restricted to 13 bits.
  • Blackwell (B200): For fp16/bf16 inputs and fp32 output, alignment occurs in 27 bits (2 integer + 23 fraction + 2 guard bits) with the sum of 16 terms implying 32 bits total for accumulator width. Block-FMA sizes and accumulator width for different input formats are listed below:
Input Format Accumulator Width Block FMA Size
fp8 32 bits (2x 16) 16 (×2)
fp16/bf16 32 bits 16
tf19 30 bits 4

In all generations, overflow behavior saturates the result to ±Inf if the running sum exceeds the internal exponent maximum (~2³⁸ for 27-bit alignment on B200) at the final normalization step. Subnormal and denormal handling is active throughout; underflow to zero only occurs if the unrounded exponent is below Emin − (p–1) (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

3. Rounding Modes, Normalization, and Extra Guard Bits

Fifth-generation tensor cores implement mixed rounding and normalization strategies:

  • fp32 Output: All formats employ truncation (round-toward-zero) at output reduction.
  • fp16 Output: Output is rounded-to-nearest-even (RNE).
  • Guard Bits: Hopper and Blackwell use two extra guard bits (ng=2n_g = 2) in low-precision alignments. These improve worst-case error bounds by effectively lowering the alignment error from 2p2^{-p} (classic machine epsilon) to 2png2^{-p - n_g}. For B200, all bits shifted beyond the 27+guard width are truncated rather than accumulated in a sticky bit; thus, sticky bits are not implemented.
  • Normalization: If the exponent after accumulation is EEminE \geq E_{min}, the result is normalized; if Emin(p1)E<EminE_{min} - (p-1) \leq E < E_{min}, a subnormal is produced; below Emin(p1)E_{min} - (p-1), output is zero. Subnormal output is preserved in the subnormal range.

These choices depart from IEEE 754 in both rounding and sticky-bit semantics, particularly affecting mixed-precision error modeling (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

4. Arithmetic Algorithms and Microarchitectural Innovations

Core arithmetic operations in fifth-generation tensor cores are executed using hardware-encoded, format-specific algorithms:

  • Sequential-FMA (SFMA): Used for FP64 DMMA, each dot-product term progresses via classic FMA, fully normalized and rounded after each operation.
  • Fused-Dot-Add (FDA): For all other formats:

    1. Each product is computed exactly in fixed-point.
    2. All products and the accumulator are aligned to the maximum exponent.
    3. Each aligned value is truncated (not rounded) to F fractional bits (F=25 for TF32/FP32/FP16/BF16; F=13 for FP8).
    4. The fixed-point sum is accumulated.
    5. The sum is normalized and reduced to output format using final rounding.

A key microarchitectural shift in Hopper is the unification of dot-products into one FDA step, eliminating earlier chain-of-FDA requirements for large tiles and improving latency. The increase in internal width (F=25) halves the worst-case accumulation error for TF32/FP16/BF16/FP32 relative to Ampere architectures (F=24). In contrast, FP8 remains limited to F=13 and thus can be a source of large absolute error in LLM workloads (Xie et al., 14 Nov 2025).

5. Validation Methodologies

Bit-exact validation has been performed using MMA-Sim for Hopper and dedicated MATLAB models for Blackwell. Validation methodology includes:

  1. Generalized Numerical Feature Testing (GNFT): Format-parameterized vectors probe subnormal support, FMA tiling, and guard bits.

  2. Input Space Search Method (ISSM): High-dimensional numerical tests with 10510^5 random (a,b,c) triples check for mismatches at the hardware and model output level, prompting iterative model refinement.
  3. Corner-case Testing: Comprehensive sweep of NaN/infinity propagation, subnormal boundaries, overflow, and PTX-level instruction disassembly validation ensures correspondence to real hardware.

Bitwise agreement is achieved on all tested vectors, including pathological edge cases. These studies reveal undocumented behavior that can yield significant numerical errors, such as the pronounced truncation bias in fp32 output mode (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

6. Consequences for Mixed-Precision Algorithms and Application Stability

The arithmetic and microarchitectural features of fifth-generation tensor cores directly impact algorithmic error bounds, application reproducibility, and stability:

  • Block-FMA Tiles and Accumulation: Single-round large tile accumulations reduce rounding error relative to sequential FMAs but can increase risk of catastrophic cancellation if intermediates become denormalized.
  • Guard Bits: The addition of two guard bits improves the effective machine epsilon and tightens backward and forward error bounds.
  • Rounding Behavior: Truncation in fp32 output induces a downward bias (as large as almost one ulp), which must be explicitly modeled in forward and backward error analysis, especially in iterative refinement and mixed-precision solvers.
  • FP8 Accumulation Limitation: Use of FP8 with only 13 accumulator bits can result in error up to 2122^{-12} per term; this is sufficient to cause instability in certain DNN and LLM training scenarios. Accumulating FP8 inputs in fp16 or bf16 substantially mitigates this error at a throughput cost.
  • Denormal Handling: Full subnormal support ensures that small partial sums contribute rather than being silently flushed, avoiding numerical discontinuity but incurring the possibility of slow denormal arithmetic.
  • Algorithm Design Adjustments: Mixed-precision GEMM and solver designers must account for non-IEEE-compliant rounding, guard-width, and accumulator configuration in both split-word length selection and error recurrences.

In Blackwell, the two guard bits and larger tile size enable deeper one-pass accumulations with lower relative rounding error, provided truncation-related bias is managed appropriately to ensure algorithmic stability (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

7. Architectural Comparison and Evolution

A comparative outline of accumulator and alignment width, block size, and guard-bit usage across recent tensor core generations:

Architecture Guard Bits (ngn_g) Block-FMA Size (N) Alignment Width (bits)
V100 0 4 24
A100 1 8 26
H100/H200 2 16 (fp16/bf16), 4 (tf19) 27
B200 2 16 (fp16/bf16), 4 (tf19) 27

The increase in guard bits and alignment width is a persistent trend, leading to improved rounding fidelity and reduced error bounds for fused-dot-products. A plausible implication is that future architectures may continue to expand hardware alignment width and algorithmic transparency to meet the growing stability requirements of massive DNN and scientific workloads (Khattak et al., 7 Dec 2025, Xie et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Fifth-Generation Tensor Cores.