Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXINT8 Microscaling Integer Format

Updated 17 February 2026
  • MXINT8 is a blockwise quantization integer format that uses per-block adaptive power-of-two scaling to achieve high memory efficiency and strong empirical fidelity.
  • It supports diverse neural architectures, including large language models, vision transformers, FFT pipelines, and robotics, with minimal accuracy drop compared to higher precision formats.
  • Adopted in leading hardware accelerators, MXINT8 optimizes multiply-accumulate operations and memory layout, facilitating significant energy and area savings in deep learning systems.

MXINT8, or “Microscaling Integer 8-bit,” is a blockwise quantization data format that encodes tensors for efficient deep learning training and inference. By combining INT8 precision with per-block adaptive power-of-two scaling, MXINT8 achieves high memory/compute efficiency, wide dynamic range, and strong empirical fidelity across a diverse set of neural architectures including LLMs, vision transformers, FFT pipelines, and robotics learners. MXINT8 is now supported in leading hardware accelerator architectures (e.g., Nvidia Blackwell, SNAX, TriGen, MASE, OPAL) and is the principal integer-based member of the Microscaling format family, complementing floating-point variants like MXFP8 and MXFP4.

1. Core Format Definition and Quantization Procedure

In MXINT8, each contiguous block of kk elements (typically k=32k = 32, though larger groupings such as 64 or 128 appear) shares a single block scale—a power-of-two factor encoded in the 8-bit E8M0 format (8-bit exponent, no mantissa). Each tensor element in the block is stored as a signed INT8 code qi[127,127]q_i \in [-127,127]—with exact range depending on the variant (some use [128,127][-128,127], but most recent works adopt symmetric clipping for unbiased gradients).

Quantization of a real-valued block xix_i proceeds by

shared_exp=log2maxixi127,s=2shared_exp,\text{shared\_exp} = \left\lceil \log_2 \frac{\max_i |x_i|}{127} \right\rceil, \quad s = 2^{\text{shared\_exp}},

qi=clip(round(xi/s),127,127).q_i = \mathrm{clip}(\mathrm{round}(x_i / s), -127, 127).

Recovery (dequantization) in downstream ops is

x^i=qis.\hat x_i = q_i \cdot s.

The block scale ss is recorded as an E8M0 exponent or integer offset per block, incurring a negligible storage overhead (e.g., 1/32 = 3.1% extra).

This format adapts the representable dynamic range per block, aligning the full INT8 code space to local block maxima; it is conceptually a “block-floating-point” integer format (Rouhani et al., 2023, Chen et al., 29 Oct 2025).

2. Numerical Characteristics and Dynamic Range

MXINT8 achieves markedly higher numerical dynamic range per block compared to per-tensor INT8 or fixed-point quantization. For each block,

  • Dynamic range: [127s,127s][-127 \cdot s, 127 \cdot s],
  • Step size (resolution): ss; the minimum representable increment is ±s\pm s.
  • Exponent span: With the 8-bit scale, each block can adapt across a span exceeding that of FP32 (212621272^{-126} \dots 2^{127}).

Compared with standard INT8 formats (global or per-channel scales), MXINT8’s per-block scaling minimizes quantization error on tensors with high intra-tensor dynamic range. In practice, this approach achieves sub-0.5% drop in accuracy relative to FP32/BF16 across LLMs (e.g., LLaMA-7B: MXINT8 PPL 5.68 vs. FP16 5.67) and vision models (DeiT-Base MXINT8 Top-1 81.84% vs. FP32 81.80%) (Chen et al., 29 Oct 2025, Xiao et al., 28 May 2025, Sharify et al., 2024, Rouhani et al., 2023).

Theoretical quantization SNR scaling for a block of size gg and crest factor κ\kappa is

QSNRMXINT84.78+6.02820log10(ρ)20log10(κ)\mathrm{QSNR_{MXINT8}} \approx 4.78 + 6.02 \cdot 8 - 20\log_{10}(\rho) - 20\log_{10}(\kappa)

where ρ[1,2)\rho \in [1, 2) is scale overhead; practical empirical SNR consistently outperforms MXFP8 when the crest factor κ<7.55\kappa < 7.55 (true for most LLM/ViT blocks) (Chen et al., 29 Oct 2025).

3. Algorithmic Implementation and Hardware Mapping

Quantization/Dequantization

A canonical high-level pseudocode for quantization:

1
2
3
4
5
6
def mxint8_quantize(block):
    vmax = max(abs(block))
    shared_exp = ceil(log2(vmax / 127))
    scale = 2**shared_exp
    q = [clip(round(x / scale), -127, 127) for x in block]
    return q, shared_exp
Dequantization is the reverse: xq2shared_expx \approx q \cdot 2^{\text{shared\_exp}}.

Block Structure and Memory Layout

Blocks of 32 (k) values:

  • k × 8 bits INT8 mantissas,
  • 8 bits per block for the exponent (scale)
  • Optionally, a metadata field (see below).

Scales are either packed at the start of each block, in separate arrays, or encoded as per hardware alignment for coalesced memory access. MXINT8 naturally aligns with 256-bit SIMD lanes (32 × 8 bits) and vectorized dot-product units (Rouhani et al., 2023, Xiao et al., 28 May 2025, Cuyckens et al., 9 Nov 2025).

Table: Representative Physical Layouts

Format Variant # Elements per Block Exponent Bits Mantissa Bits/Element Metadata
MXINT8 (standard) 32 8 8 None
MXINT8 (metadata-augmented) 32 8 8 24 bits/group
MXINT8 (square-group) 64 8 8 None

(Rouhani et al., 2023, Hu et al., 27 Jan 2026, Cuyckens et al., 28 May 2025)

4. Hardware Pipelines and Accelerator Design

MXINT8 is engineered for maximal hardware efficiency:

Metadata-Augmented Variants

Recent research augments MXINT8 with lightweight per-block metadata (e.g., 3 bits per subgroup for extra mantissa precision), gaining up to 60% accuracy restoration over plain 4-bit MXFP4 at minimal (<5%) area/power cost (Hu et al., 27 Jan 2026).

5. Empirical Performance and Use Cases

Across large-scale LLMs, ViTs, and edge robotics:

  • LLMs: No accuracy cliff as seen in uniform INT8; PPL within 1% of FP16. Hadamard-rotated MXINT8 further improves SNR where crest factor is high (Chen et al., 29 Oct 2025, Sharify et al., 2024).
  • ViTs: MXINT8 enables full-model mapping, including Softmax/LayerNorm, yielding <1% Top-1 drop and 93–1024×\times speedup versus FP16 flows (Xiao et al., 28 May 2025).
  • Robotics & continual learning: Area/memory halved and 4× throughput at negligible accuracy cost on control/reinforcement tasks (Cuyckens et al., 28 May 2025).
  • FFT Pipelines: End-to-end normalized MSE scales as NMSELϵm2\text{NMSE} \approx L \epsilon_m^2, with quantization error set by block mantissa width (worst-case ϵm=1/2Mmax0.004\epsilon_m=1/2M_{\max}\approx 0.004 for Mmax=127M_{\max}=127) (Deveshwar et al., 3 Dec 2025).
  • Post-Training Quantization (PTQ): Works synergistically with SmoothQuant, GPTQ, AWQ; MXINT8 alone achieves near-baseline perplexity (Sharify et al., 2024).

6. Limitations, Accuracy-Performance Trade-offs, and Hybrid Remedies

Quantization bias and instability: In full-precision LLM training, pure blockwise quantization of all weights/activations and LayerNorm parameters can induce instabilities (“loss spikes”) due to catastrophic clamping of clustered LayerNorm γ\gamma parameters (Su et al., 25 Jun 2025). Symmetric clipping ([−127, 127] code range) is essential to eliminate gradient bias in STE-based training (Chen et al., 29 Oct 2025).

Block size trade-offs: Larger blocks reduce exponent overhead but expose the format to skew-induced coarse quantization error. Empirically, k=32k=32 is a standard choice, balancing metadata cost and quantization noise. For FFT and ViT hardwares, kk scales with tile sizes and operator-specific SNR targets (Deveshwar et al., 3 Dec 2025, Xiao et al., 28 May 2025).

Instability mitigation: Two robust hybrid schemes restore BF16-equivalent scaling laws and stability:

  1. Quantize only weights to MXINT8; keep activations/LayerNorm in BF16.
  2. Quantize only the forward GEMMs/matmul ops to MXINT8; accumulation/grads in higher precision (Su et al., 25 Jun 2025).

Outlier handling: MXINT8 with outlier-preserved extensions (e.g., OPAL) stores a fixed number of largest activations per block in BF16, preserving accuracy under high skew with minor area/latency overhead (Koo et al., 2024).

7. Extensions and Recent Developments

  • Metadata augmentation: Incorporating subgroup extra-mantissa (Sg-EM) and element-level (Elem-EM) correction increases effective precision, with typical EBW rising to 9 bits/element (vs. 8) and accuracy approaching bfloat16 (Hu et al., 27 Jan 2026).
  • Rotation-based outlier mitigation: Pre-block Hadamard rotation spreads energy, enhances SNR, and enables even 4-bit integer MX formats (NVINT4) to outperform floating-point at matched block sizes (Chen et al., 29 Oct 2025).
  • End-to-end hardware datapath optimization: Full-FPGA/NPU accelerator designs now systematically pipeline all attention and normalization ops in MXINT8, eliminating CPU fallback and maximizing bounding throughput on edge/embedded deployments (Xiao et al., 28 May 2025, Lee et al., 13 Feb 2026).

MXINT8, by fusing block-shared power-of-two scaling, symmetric integer representation, and highly efficient hardware mapping, defines a new standard for low-bit quantization in both inference and training. Its practical tractability, theoretical transparency, and universal adoption across industrial and academic accelerator designs have established it as the leading integer-centric microscaling format for scalable deep neural computation (Rouhani et al., 2023, Su et al., 25 Jun 2025, Chen et al., 29 Oct 2025, Sharify et al., 2024, Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025, Deveshwar et al., 3 Dec 2025, Hu et al., 27 Jan 2026, Lee et al., 13 Feb 2026, Xiao et al., 28 May 2025, Cheng et al., 2023, Gorodecky et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXINT8 Microscaling Integer Format.