Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXINT8: Block-Based Integer Format

Updated 19 February 2026
  • MXINT8 is a fine-grained, block-based integer format that employs per-block scaling with a shared exponent and 8-bit per-element integers to optimize memory and computation.
  • The method achieves near-lossless inference and training accuracy with less than 0.3% deviation from FP32 while reducing memory usage by approximately 3.9×.
  • MXINT8 integrates seamlessly into integer dataflow pipelines, enabling efficient hardware implementations with up to 3.7× speedup and significant energy savings on edge devices.

MXINT8 is a fine-grained, block-based integer data format designed for high-efficiency deep learning inference and training. It belongs to the Microscaling (MX) family, characterized by per-block scaling, integer mantissas, and exponent sharing, providing a superior trade-off between algorithmic accuracy, hardware simplicity, dynamic range, and memory efficiency compared to conventional per-tensor quantization and narrow floating-point (FP) alternatives. MXINT8 underpins both state-of-the-art hardware implementation and algorithmic methods for low-bitwidth neural network representation on resource-constrained and accelerator platforms.

1. Definition and Data Representation

MXINT8 encodes blocks (typically 32 elements) of real values xix_i as pairs (X,Pi)(X, P_i):

  • XX: a shared block scale, typically stored as an 8-bit exponent in E8M0 format (i.e., exact power-of-two, zero mantissa).
  • PiP_i: a signed 8-bit integer per element, Pi[127,127]P_i \in [-127,127] (using symmetric range for training).
  • Reconstruction: xi=X×Pix_i = X \times P_i.

Formally: xi=XPix_i = X \cdot P_i with X=2shared_expX = 2^{\mathrm{shared\_exp}}, shared_exp\mathrm{shared\_exp} chosen so that xj/X|x_j|/X fits in [127,127][-127,127] over the block (Rouhani et al., 2023, Chen et al., 29 Oct 2025).

Each block thus requires 8+8×32=2648 + 8 \times 32 = 264 bits for 32 values, an average of 8.25\approx 8.25 bits/value, achieving a 3.9×\sim 3.9\times reduction in memory compared to FP32. The scaling enables the full use of the INT8 dynamic range for every block, minimizing representational error for high-variance data.

2. Quantization, Dequantization, and Conversion Pipelines

Analytic Conversion (Direct-Cast)

The canonical quantization pipeline for float-to-MXINT8 conversion is as follows (Rouhani et al., 2023, Gorodecky et al., 2024, Chen et al., 29 Oct 2025):

  1. Shared Scale Selection: Compute s=maxjxj/127s = \max_j |x_j| / 127 over each block of 32, round up to nearest power-of-two s=2log2ss' = 2^{\lceil \log_2 s \rceil}, and use X=sX = s' (E8M0 encoding).
  2. Quantization: Pj=clip(round(xj/X),127,+127)P_j = \mathrm{clip}(\mathrm{round}(x_j/X), -127, +127).
  3. Block Packing: Store XX, then each PjP_j.

Dequantization is simply: xjX×Pjx_j \approx X \times P_j.

This approach requires no calibration or quantization-aware retraining and enables near-lossless inference for standard tasks (Rouhani et al., 2023, Chen et al., 29 Oct 2025).

Hardware Pipelines

Efficient hardware realizations use combinational datapaths:

  • Max-exponent finder: A comparator tree computes Emax=maxi(EVi)E_{\max} = \max_i (E_{Vi}).
  • Scale generation: Computes XX based on EmaxE_{\max} and handles NaN/Inf.
  • Per-lane quantization: Each lane forms PiP_i from sign, local exponent, and mantissa bits with round-to-nearest-even (Gorodecky et al., 2024).

FPGAs implement this flow at 19.8M vectors/sec with pure combinational logic (no BRAM/DSP), requiring 1,614 LUTs for 32-lane conversion (Gorodecky et al., 2024).

3. Training and Symmetric Clipping

MXINT8 supports both static and quantization-aware training (QAT), provided symmetric quantization is enforced (Chen et al., 29 Oct 2025):

  • Two's-complement asymmetry (INT8's [128,127][-128,127]) introduces gradient bias; symmetric clamp to [127,127][-127,127] eliminates this and ensures unbiased updates.
  • Six per-layer quantizations are typical in GEMM: weights, activations, backward-activations, weights^\top, backward-weights, activations^\top.
  • The straight-through estimator is used for gradients; accumulations remain in FP32 during training.

Lossless accuracy (within 0.1%0.1\%) is achievable for both inference and training across a range of model scales and tasks; e.g., LLMs, vision models (Chen et al., 29 Oct 2025, Wu, 2020).

4. Algorithmic Accuracy, Task Performance, and Comparisons

Empirical studies on >20 benchmarks (ImageNet, LLaMA, GPT-3, transformer tasks) show:

  • MXINT8 matches FP32 in direct-cast inference within $0.1$–0.3%0.3\% accuracy margin (Rouhani et al., 2023, Chen et al., 29 Oct 2025).
  • Wins over blockwise FP8 (E4M3, E5M2) for 8-bit block-32 configurations; at 4-bit, FP4 can be more robust without Hadamard rotation (Chen et al., 29 Oct 2025).
  • Training over Llama-style models: MXINT8, BF16, and MXFP8 track closely in loss and accuracy, with MXINT8 slightly outperforming on most tasks (Chen et al., 29 Oct 2025).
  • On edge hardware, INT8 pipelines (e.g., IntAttention) achieve up to 3.7×3.7\times speedup and 61%61\% energy savings with negligible accuracy drop, due to complete avoidance of dequantize–requantize overheads (Zhong et al., 26 Nov 2025).

Key comparative points:

Format Block Size Accuracy (Δ FP32) Area/Energy Rel. FP8 Best Use Case
MXINT8 32 <0.1–0.3% 0.79×/0.63× (Chen et al., 29 Oct 2025) General, LLMs
MXFP8 32 <0.3% 1.0×/1.0× FP-dominated
NVINT4 16 Needs rotation <0.4% Extreme low-bit

MXINT8 has a fundamental advantage for moderate crest factor distributions (κ<7.6\kappa<7.6), typical of deep learning layers with block size 32, and suffers less from outlier-induced overflow than per-tensor INT8 (Chen et al., 29 Oct 2025).

5. Hardware Implementation and Architectural Integration

MXINT8's structure is well-suited for accelerator integration:

Compared to fixed-point and per-tensor INT8, MXINT8 avoids precision loss under high local dynamic range, with only \sim1.2×\times area overhead vs. INT8 and negligible energy increase (Cheng et al., 2023).

6. Integration with Integer Dataflow and Transformations

MXINT8 enables end-to-end integer execution. Recent developments extend integer dataflow throughout the transformer block:

  • Attention pipelines (such as IntAttention) keep all major matrix-multiplies, softmax surrogates (IndexSoftmax), and normalization in integer, with integer LUTs and normalization, supporting full plug-and-play deployment without retraining (Zhong et al., 26 Nov 2025).
  • Integer transformers employ integer-friendly nonlinearities (e.g., polynomial attention, L1 norm layernorm) and propagate scales directly, entirely within INT8/INT32 domains except on rare overflow (Lin et al., 2020).

This enables deployment of large models on edge devices, yielding 3–4×\times speedups, 4×\times model compression, and retaining essentially all baseline accuracy.

7. Practical Programming, Inference Engines, and Compiler Orchestration

  • Programming mixed-precision MXINT8 inference on RISC-V or ARM CPUs leverages status-based SIMD instructions, allowing per-layer bitwidth selection from a status register, supporting run-time reconfigurability without ISA expansion (Ottavi et al., 2020).
  • Dataflow compilers (e.g., MASE) optimize per-tensor mantissa widths and MXINT8 block shapes to maximize accuracy and minimize area/throughput at the compiler IR level, with automated hardware RTL emission for MXINT8 operations (Cheng et al., 2023).
  • For inference libraries (e.g., CUDA MX library), quantization, block-wise dot-product, and dequantization are handled in optimized kernels, easily integrated as direct drop-ins to standard inference stacks (Rouhani et al., 2023).

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXINT8.