Papers
Topics
Authors
Recent
Search
2000 character limit reached

MATLAB Tensor Core v0.1 Emulator

Updated 7 January 2026
  • MATLAB Tensor Core v0.1 is a pure-MATLAB MMA emulator that accurately models NVIDIA Tensor Core units across V100, A100, H100, and B200 GPUs.
  • It emulates mixed-precision matrix multiplication by replicating hardware-specific rounding behavior, IEEE-754 deviations, and special value handling.
  • The emulator supports various NVIDIA data formats and parameterized block-wise fused-multiply-add workflows, enabling detailed investigations of hardware behavior.

MATLAB Tensor Core v0.1 is a pure-MATLAB mixed-precision matrix multiply-accumulate (MMA) emulator designed to replicate the numerical behavior of NVIDIA Tensor Core units across major data center GPU generations, specifically the V100, A100, H100, and B200. The emulator provides bit-faithful software models for the inner product operations in these low- and mixed-precision hardware units, including all observed deviations from IEEE 754 floating-point semantics, rounding behavior, and special value handling. It is parameterized to support the block-wise fused-multiply-add workflow and numerical properties of each target architecture, enabling algorithm developers to investigate and validate the full spectrum of effects associated with NVIDIA's hardware-level matrix multiplication (Khattak et al., 7 Dec 2025).

1. Supported Formats and Bit Layouts

Tensor Core v0.1 supports the following NVIDIA-related formats, each represented as an unsigned integer bit-pattern of the specified width:

Format Bit Layout Exponent Bias
fp8-E4M3 [S E3 E2 E1 E0
fp8-E5M2 [S E4…E0
fp16 [S E14…E10
bf16 [S E22…E15
tf19 [S E22…E15
fp32 [S E30…E23

Special value handling for each format follows IEEE-754 encoding conventions, accommodating subnormals, zeros, infinities, and NaN patterns per format-specific rules. Subnormals are neither flushed to zero nor eliminated during intermediate calculation, consistent across all supported hardware models (Khattak et al., 7 Dec 2025).

2. Inner-Product Arithmetic Model

The MMA operation calculates a block-wise inner product of kk terms between operand vectors aa and bb, with optional accumulation via a third input cc. The computation proceeds as follows, for block size NblockN_{block}:

  1. Partitioning: The kk products are split into k/Nblock\lceil k / N_{block} \rceil blocks.
  2. Product Calculation (per block):
    • Inputs are cast into an internal product precision: exponent range matches the input, but mantissa width is doubled.
    • Products are calculated exactly, then rounded to "product-alignment" precision: $2$ integer bits and min+neabm_{in} + n_{eab} mantissa bits, where minm_{in} is input mantissa and neabn_{eab} architecture-specific extra alignment bits.
  3. Alignment and Addition:
    • All products within a block are bit-shifted to a common maximum exponent.
    • Fixed-point integer addition aggregates aligned terms, plus cc if in "c_early" mode.
  4. Normalization and Block Rounding:
    • The block sum is normalized and rounded to accumulator precision: $2$ integer bits, output mantissa bits, plus neabn_{eab}.
    • The result is used as accumulator input for the next block or, if "c_late," cc is added after all blocks.
  5. Final Casting:
    • After all blocks, the sum is rounded to the output format (fp16, fp32, etc.) using the designated rounding mode.

Mathematically, for two blocks:

S1=FlA(=1N1RoundProduct(a,b)) S2=FlA(S1+=N1+1kRoundProduct(a,b)) d=RoundO(S2+c)S_1 = Fl_{A}\left(\sum_{\ell=1}^{N_1} \text{RoundProduct}(a_\ell, b_\ell) \right) \ S_2 = Fl_{A}\left( S_1 + \sum_{\ell=N_1+1}^{k} \text{RoundProduct}(a_\ell, b_\ell) \right) \ d = \text{Round}_O (S_2 + c)

where FlAFl_A denotes rounding to accumulator precision AA, and RoundO\text{Round}_O is the final format cast (Khattak et al., 7 Dec 2025).

3. Rounding Modes and Error Bounds

Tensor Core v0.1 supports multiple rounding modes:

  • RNE (Round to Nearest, ties to Even)
  • RTZ (Round toward Zero)
  • RU (Round toward ++\infty)
  • RD (Round toward -\infty)

Internal block-sum normalization uses either RTZ or RNE, depending on hardware mapping. Output rounding is typically RNE for fp16 outputs and RTZ for fp32, with exceptions such as V100 using RTZ internally and only RNE when writing back to fp16. The rounding error for RNE is bounded as x^x12ulp(target format)|x̂ - x| \leq \frac{1}{2} \cdot \textrm{ulp}(\text{target format}), or x^xulp(target format)|x̂ - x| \leq \textrm{ulp}(\text{target format}) for RTZ. Here, pp is the mantissa (plus implicit bit) count, so un=2(p1)u_n = 2^{-(p-1)} (Khattak et al., 7 Dec 2025).

4. Accumulator Width and Normalization Points

The internal accumulator is structured to provide exact alignment and minimal overflow risk using:

Wa=2  (integer bits)+pout+neabW_a = 2 \; (\text{integer bits}) + p_{\text{out}} + n_{eab}

where poutp_{\text{out}} is 23 for fp32, 10 for fp16 output, and neabn_{eab} is a hardware-specific number of extra alignment bits.

GPU Model (mode) Block Size N1N_1 neabn_{eab} WaW_a Final Rounding
V100 (fp16→fp32) 4 0 25 (28 incl. adder) RTZ
A100/A2/A30 (fp16/bf16→fp32) 8 1 26 (30) RTZ (fp32), RNE (fp16)
H100/H200/B200 (fp16/bf16→fp32) 16 2 27 (31/32) RTZ (fp32), RNE (fp16)
fp8→fp32/16 (H100 family) 16 for k=32k=32 2 12 As above

A plausible implication is that tuning the neabn_{eab} parameter and block sizes in MATLAB Tensor Core v0.1 enables developers to exactly match the rounding and normalization quirks of each physical hardware unit (Khattak et al., 7 Dec 2025).

5. Special-Value and Exception Handling

Tensor Core v0.1 correctly models IEEE-754 special-value cases:

  • Subnormals: Supported as both inputs and intermediate results; there is no flush-to-zero.
  • NaN and Infinity: Any operation involving NaN yields NaN; operations such as ±×0\pm\infty \times 0 produce NaN. Summing signed infinities of opposite sign yields NaN.
  • Signed Zeros: Maintained through alignment and rounding according to IEEE-754; positive and negative zeros are preserved.

All hardware models conform to these rules for every supported format and block configuration (Khattak et al., 7 Dec 2025).

6. MATLAB Reference Implementation and Pseudocode

A generic MMA operation over a single block can be implemented using only basic integer, floating-point, and bit-manipulation primitives in MATLAB. The core functions are:

  • decode_fp (bit extraction)
  • round_fixed (fixed-point mantissa rounding)
  • normalize_integer (normalization)
  • encode_fp (packing bit-fields)
  • two_op_align_and_add (final two-term FMA logic)
  • cast_fp (final output packing)

Sample pseudocode for one block's MMA operation, parameterized for all core hardware models, is provided in the design specification. This implementation supports both early and late cc-addition, all standard rounding modes, and interleaving/fused patterns observed on H100/B200. The block-based FMA is internally replicated by GEMM.m to realize full matrix multiplication (Khattak et al., 7 Dec 2025).

7. Hardware Quirks and IEEE-754 Deviations

The emulator preserves several key deviations of NVIDIA Tensor Core hardware from strict IEEE-754 FMAs:

  • Denormalized Accumulation: Partial products are not individually normalized; the largest exponent only governs the alignment within each block. Smaller exponent results are truncated, rather than sticky rounded, which can induce denormalization in internal sums.
  • Late vs. Early cc-Addition: On V100/A100, the timing of addend cc significantly affects rounding; both modes are replicated.
  • Interleaved Blocks for fp8 on H100/B200: With k=32k=32, products are grouped into two interleaved blocks of 16, each summed independently and combined at the end.
  • Final Rounding Anomalies: Some devices use truncation consistently except for final fp16 output (RNE); others switch rounding styles depending on output format or writeback step.

All anomalies are encoded in the parameter struct:

1
2
3
4
5
6
7
8
params = struct( ...
  'N', <block size>, ...
  'n_eab', <extra align bits>, ...
  'fr_mode', 'rtz' | 'rne', ...
  'fr_mode_final','rtz'|'rne', ...
  'c_placement','early'|'late', ...
  'inter_pattern', true|false ...
);
Correct reproduction of hardware behavior, including denormalization and rounding bugs, has been empirically verified using 10510^5 random and pathological (Inf/NaN/subnormal) input vectors (Khattak et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MATLAB Tensor Core v0.1.