MATLAB Tensor Core v0.1 Emulator
- MATLAB Tensor Core v0.1 is a pure-MATLAB MMA emulator that accurately models NVIDIA Tensor Core units across V100, A100, H100, and B200 GPUs.
- It emulates mixed-precision matrix multiplication by replicating hardware-specific rounding behavior, IEEE-754 deviations, and special value handling.
- The emulator supports various NVIDIA data formats and parameterized block-wise fused-multiply-add workflows, enabling detailed investigations of hardware behavior.
MATLAB Tensor Core v0.1 is a pure-MATLAB mixed-precision matrix multiply-accumulate (MMA) emulator designed to replicate the numerical behavior of NVIDIA Tensor Core units across major data center GPU generations, specifically the V100, A100, H100, and B200. The emulator provides bit-faithful software models for the inner product operations in these low- and mixed-precision hardware units, including all observed deviations from IEEE 754 floating-point semantics, rounding behavior, and special value handling. It is parameterized to support the block-wise fused-multiply-add workflow and numerical properties of each target architecture, enabling algorithm developers to investigate and validate the full spectrum of effects associated with NVIDIA's hardware-level matrix multiplication (Khattak et al., 7 Dec 2025).
1. Supported Formats and Bit Layouts
Tensor Core v0.1 supports the following NVIDIA-related formats, each represented as an unsigned integer bit-pattern of the specified width:
| Format | Bit Layout | Exponent Bias |
|---|---|---|
| fp8-E4M3 | [S | E3 E2 E1 E0 |
| fp8-E5M2 | [S | E4…E0 |
| fp16 | [S | E14…E10 |
| bf16 | [S | E22…E15 |
| tf19 | [S | E22…E15 |
| fp32 | [S | E30…E23 |
Special value handling for each format follows IEEE-754 encoding conventions, accommodating subnormals, zeros, infinities, and NaN patterns per format-specific rules. Subnormals are neither flushed to zero nor eliminated during intermediate calculation, consistent across all supported hardware models (Khattak et al., 7 Dec 2025).
2. Inner-Product Arithmetic Model
The MMA operation calculates a block-wise inner product of terms between operand vectors and , with optional accumulation via a third input . The computation proceeds as follows, for block size :
- Partitioning: The products are split into blocks.
- Product Calculation (per block):
- Inputs are cast into an internal product precision: exponent range matches the input, but mantissa width is doubled.
- Products are calculated exactly, then rounded to "product-alignment" precision: $2$ integer bits and mantissa bits, where is input mantissa and architecture-specific extra alignment bits.
- Alignment and Addition:
- All products within a block are bit-shifted to a common maximum exponent.
- Fixed-point integer addition aggregates aligned terms, plus if in "c_early" mode.
- Normalization and Block Rounding:
- The block sum is normalized and rounded to accumulator precision: $2$ integer bits, output mantissa bits, plus .
- The result is used as accumulator input for the next block or, if "c_late," is added after all blocks.
- Final Casting:
- After all blocks, the sum is rounded to the output format (fp16, fp32, etc.) using the designated rounding mode.
Mathematically, for two blocks:
where denotes rounding to accumulator precision , and is the final format cast (Khattak et al., 7 Dec 2025).
3. Rounding Modes and Error Bounds
Tensor Core v0.1 supports multiple rounding modes:
- RNE (Round to Nearest, ties to Even)
- RTZ (Round toward Zero)
- RU (Round toward )
- RD (Round toward )
Internal block-sum normalization uses either RTZ or RNE, depending on hardware mapping. Output rounding is typically RNE for fp16 outputs and RTZ for fp32, with exceptions such as V100 using RTZ internally and only RNE when writing back to fp16. The rounding error for RNE is bounded as , or for RTZ. Here, is the mantissa (plus implicit bit) count, so (Khattak et al., 7 Dec 2025).
4. Accumulator Width and Normalization Points
The internal accumulator is structured to provide exact alignment and minimal overflow risk using:
where is 23 for fp32, 10 for fp16 output, and is a hardware-specific number of extra alignment bits.
| GPU Model (mode) | Block Size | Final Rounding | ||
|---|---|---|---|---|
| V100 (fp16→fp32) | 4 | 0 | 25 (28 incl. adder) | RTZ |
| A100/A2/A30 (fp16/bf16→fp32) | 8 | 1 | 26 (30) | RTZ (fp32), RNE (fp16) |
| H100/H200/B200 (fp16/bf16→fp32) | 16 | 2 | 27 (31/32) | RTZ (fp32), RNE (fp16) |
| fp8→fp32/16 (H100 family) | 16 for | 2 | 12 | As above |
A plausible implication is that tuning the parameter and block sizes in MATLAB Tensor Core v0.1 enables developers to exactly match the rounding and normalization quirks of each physical hardware unit (Khattak et al., 7 Dec 2025).
5. Special-Value and Exception Handling
Tensor Core v0.1 correctly models IEEE-754 special-value cases:
- Subnormals: Supported as both inputs and intermediate results; there is no flush-to-zero.
- NaN and Infinity: Any operation involving NaN yields NaN; operations such as produce NaN. Summing signed infinities of opposite sign yields NaN.
- Signed Zeros: Maintained through alignment and rounding according to IEEE-754; positive and negative zeros are preserved.
All hardware models conform to these rules for every supported format and block configuration (Khattak et al., 7 Dec 2025).
6. MATLAB Reference Implementation and Pseudocode
A generic MMA operation over a single block can be implemented using only basic integer, floating-point, and bit-manipulation primitives in MATLAB. The core functions are:
decode_fp(bit extraction)round_fixed(fixed-point mantissa rounding)normalize_integer(normalization)encode_fp(packing bit-fields)two_op_align_and_add(final two-term FMA logic)cast_fp(final output packing)
Sample pseudocode for one block's MMA operation, parameterized for all core hardware models, is provided in the design specification. This implementation supports both early and late -addition, all standard rounding modes, and interleaving/fused patterns observed on H100/B200. The block-based FMA is internally replicated by GEMM.m to realize full matrix multiplication (Khattak et al., 7 Dec 2025).
7. Hardware Quirks and IEEE-754 Deviations
The emulator preserves several key deviations of NVIDIA Tensor Core hardware from strict IEEE-754 FMAs:
- Denormalized Accumulation: Partial products are not individually normalized; the largest exponent only governs the alignment within each block. Smaller exponent results are truncated, rather than sticky rounded, which can induce denormalization in internal sums.
- Late vs. Early -Addition: On V100/A100, the timing of addend significantly affects rounding; both modes are replicated.
- Interleaved Blocks for fp8 on H100/B200: With , products are grouped into two interleaved blocks of 16, each summed independently and combined at the end.
- Final Rounding Anomalies: Some devices use truncation consistently except for final fp16 output (RNE); others switch rounding styles depending on output format or writeback step.
All anomalies are encoded in the parameter struct:
1 2 3 4 5 6 7 8 |
params = struct( ... 'N', <block size>, ... 'n_eab', <extra align bits>, ... 'fr_mode', 'rtz' | 'rne', ... 'fr_mode_final','rtz'|'rne', ... 'c_placement','early'|'late', ... 'inter_pattern', true|false ... ); |