Accurate Models of NVIDIA Tensor Cores (2512.07004v1)

Published 7 Dec 2025 in cs.MS, cs.AR, and math.NA

Abstract: Matrix multiplication is a fundamental operation in for both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers-such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others-test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behavior of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

Summary

The paper introduces a two-stage testing process (GNFT and ISSM) to accurately model mixed-precision behavior of NVIDIA tensor cores.
It validates detailed MATLAB-based models across architectures like V100, A100, H100, and B200 to ensure bit-exact simulations.
The study demonstrates how numerical nuances in FMA size and alignment affect error propagation, enhancing algorithm design and hardware portability.

Accurate Modeling of NVIDIA Tensor Cores: Methodology, Validation, and Implications

Introduction

NVIDIA tensor cores implement mixed-precision matrix multiplication to accelerate both DNN training and scientific computing workloads. However, their arithmetic behavior diverges from IEEE 754, is not standardized across vendors or even GPU generations, and often lacks proper documentation, hindering portability and reproducibility for both algorithm and system designers. This paper addresses the need for high-fidelity, vendor-specific software models of tensor core behavior, particularly for low-precision and mixed-precision arithmetic, by establishing a methodology and delivering comprehensive MATLAB-based models validated against actual NVIDIA hardware spanning V100 through B200 architectures.

Methodology for Numerical Feature Determination

The authors present a two-stage process for accurately characterizing and emulating tensor core arithmetic:

Generalized Numerical Feature Testing (GNFT): This relies on algorithmically constructed test vectors, inspired by exhaustive and targeted vector strategies, to identify features such as: support for subnormal values, rounding and truncation modes, normalization points, accumulator width, extra alignment bits, and block FMA sizes for all input datatype combinations (FN8, FP16, BF16, TF19, FP32).
Input Space Search Method (ISSM): The preliminary model obtained from GNFT is refined via systematic randomized testing: large ensembles of matrix inputs are generated, and discrepancies in output between the model and actual hardware are used to further correct the software model. This iterative cycle is crucial for identifying nuances such as denormalized accumulation, interleaved partitioning of inner products (notably in H100/H200/B200), and precise accumulator characteristics.

This methodology generalizes, improves, and in several aspects corrects prior work [hibr19, fhmp21, llfs24, vlpg25, khmi25].

The process was applied to the tensor cores of V100, A100, H100, H200, B200, L40S, A2, A30, and Ada RTX 1000 GPUs, and all supported input/output format variants:

V100: The model reveals a 4-element block FMA with 25-bit alignment; alignment occurs with denormalized products and truncation modes, diverging in detail from previous published claims.
A100/A2/A30: FMA size of 8 in FP16/BF16 modes with 1 extra alignment bit (26-bit alignment) and truncation after normalization. FP64 on A100 is IEEE-compliant, single-block FMA.
H100/H200/B200: For FP16/BF16, exactly two extra alignment bits (27-bit alignment), block FMA size of 16, denormalized accumulation, truncations, and interleaved accumulation for FP8 inner products (partitioned, not contiguous, affecting error properties and requiring special attention in modeling).
L40S/Ada RTX 1000: Similar to A100 for FP16/BF16/TF19, but in FP8 modes, only 13 fraction bits are used in accumulation (instead of 25), and there is no interleaved pattern as found in H100-class hardware.
For all, accumulator widths, alignment schemes, and normalization/truncation points are rigorously established and differ meaningfully by generation.

Randomized and feature-specific tests with $10^5$ matrix products per configuration ensured that every model instance produced bit-exact output, and each discovered mismatch identified previously undocumented hardware behaviors.

Software Model Implementation and Utility

The resulting models are released as MATLAB code (MATLAB Tensor Core v0.1), highly parameterizable to reflect discovered architectural subtleties. Models can be instantiated for specific hardware or for arbitrary custom configurations (block size, extra/guard bits, rounding mode, interleaving, etc.), making them valuable for:

Authenticating targeted test vectors for feature-specific detection.
Supporting algorithm designers with hardware-accurate, vendor-specific simulation.
Providing a foundation for multi-word arithmetic emulation, error analysis, and mixed-precision research.
Decoupling algorithm development from hardware access, critical for portability and verification under future evolving hardware and standards.

An explicit interface is given for calling models corresponding to each GPU generation and for composing custom variants. Exceptional-value handling (NaN/Inf/subnormals) is specifically matched to the hardware.

Implications for Algorithm Analysis and GEMM Emulation

As an application, the authors demonstrate high-precision GEMM emulation using multi-word splitting over low-precision tensor cores, validating the respective numerical behaviors over generations:

For single-block computation, error is stable across cores for moderate problem sizes but diverges for large $n$ .
With multi-word splitting, error is reduced, notably more so for V100 and B200 with round-to-nearest (compared to truncation).
B200 and derivatives (H100, H200) demonstrate higher error than V100 for large $n$ if bit truncation is used, owing to accumulator implementation; this is mitigated with careful accumulator rounding.
L40S (and Ada RTX 1000) exhibit the highest error due to their restricted accumulator width.

These results provide strong evidence that differences in FMA size, alignment bit count, rounding/truncation positions, and interleaving materially affect error propagation and must be incorporated in any reliable numerical analysis or precision emulation pipeline using modern GPUs.

Theoretical and Practical Implications

This work fills the methodological and tooling gap left by hardware vendors regarding cross-architecture, low-precision matrix arithmetic. It establishes an extensible methodology for mapping hardware-to-software models and exposes "hidden" numerical details that prior studies missed or mischaracterized.

For practitioners, the immediate outcome is robust mixed-precision simulation without full hardware access—a necessity as hardware cycles through generations with non-standardized changes. For theorists and standards bodies (noting IEEE P3109 and OCP), the work underscores the incompatibility of “portable” numerical software with current non-standardized mixed-precision reduction operators and motivates concrete standardization efforts.

Finally, the modular, parameterized model provides a valuable platform for future research in matrix engine design, new GEMM algorithms, error analysis, auto-tuning, and the impact assessment of hardware features on both AI and HPC codes.

Conclusion

This paper presents technically rigorous, empirically validated, and openly accessible models for simulating NVIDIA’s tensor core matrix multiplication across major architectures and data format regimes. The presented methodology and toolbox mark a significant advance for mixed-precision algorithm development, error analysis, and cross-architecture portability in high-performance and AI computing research. They also provide concrete evidence for the necessity of detailed, standardized descriptions of hardware arithmetic behavior for future reproducibility and numerical analysis. The authors outline future plans to extend language support, backend performance, model coverage, and model validation strategies, ensuring ongoing utility and adaptability of their contributions.