MXINT8 Microscaling Integer Format
- MXINT8 is a blockwise quantization integer format that uses per-block adaptive power-of-two scaling to achieve high memory efficiency and strong empirical fidelity.
- It supports diverse neural architectures, including large language models, vision transformers, FFT pipelines, and robotics, with minimal accuracy drop compared to higher precision formats.
- Adopted in leading hardware accelerators, MXINT8 optimizes multiply-accumulate operations and memory layout, facilitating significant energy and area savings in deep learning systems.
MXINT8, or “Microscaling Integer 8-bit,” is a blockwise quantization data format that encodes tensors for efficient deep learning training and inference. By combining INT8 precision with per-block adaptive power-of-two scaling, MXINT8 achieves high memory/compute efficiency, wide dynamic range, and strong empirical fidelity across a diverse set of neural architectures including LLMs, vision transformers, FFT pipelines, and robotics learners. MXINT8 is now supported in leading hardware accelerator architectures (e.g., Nvidia Blackwell, SNAX, TriGen, MASE, OPAL) and is the principal integer-based member of the Microscaling format family, complementing floating-point variants like MXFP8 and MXFP4.
1. Core Format Definition and Quantization Procedure
In MXINT8, each contiguous block of elements (typically , though larger groupings such as 64 or 128 appear) shares a single block scale—a power-of-two factor encoded in the 8-bit E8M0 format (8-bit exponent, no mantissa). Each tensor element in the block is stored as a signed INT8 code —with exact range depending on the variant (some use , but most recent works adopt symmetric clipping for unbiased gradients).
Quantization of a real-valued block proceeds by
Recovery (dequantization) in downstream ops is
The block scale is recorded as an E8M0 exponent or integer offset per block, incurring a negligible storage overhead (e.g., 1/32 = 3.1% extra).
This format adapts the representable dynamic range per block, aligning the full INT8 code space to local block maxima; it is conceptually a “block-floating-point” integer format (Rouhani et al., 2023, Chen et al., 29 Oct 2025).
2. Numerical Characteristics and Dynamic Range
MXINT8 achieves markedly higher numerical dynamic range per block compared to per-tensor INT8 or fixed-point quantization. For each block,
- Dynamic range: ,
- Step size (resolution): ; the minimum representable increment is .
- Exponent span: With the 8-bit scale, each block can adapt across a span exceeding that of FP32 ().
Compared with standard INT8 formats (global or per-channel scales), MXINT8’s per-block scaling minimizes quantization error on tensors with high intra-tensor dynamic range. In practice, this approach achieves sub-0.5% drop in accuracy relative to FP32/BF16 across LLMs (e.g., LLaMA-7B: MXINT8 PPL 5.68 vs. FP16 5.67) and vision models (DeiT-Base MXINT8 Top-1 81.84% vs. FP32 81.80%) (Chen et al., 29 Oct 2025, Xiao et al., 28 May 2025, Sharify et al., 2024, Rouhani et al., 2023).
Theoretical quantization SNR scaling for a block of size and crest factor is
where is scale overhead; practical empirical SNR consistently outperforms MXFP8 when the crest factor (true for most LLM/ViT blocks) (Chen et al., 29 Oct 2025).
3. Algorithmic Implementation and Hardware Mapping
Quantization/Dequantization
A canonical high-level pseudocode for quantization:
1 2 3 4 5 6 |
def mxint8_quantize(block): vmax = max(abs(block)) shared_exp = ceil(log2(vmax / 127)) scale = 2**shared_exp q = [clip(round(x / scale), -127, 127) for x in block] return q, shared_exp |
Block Structure and Memory Layout
Blocks of 32 (k) values:
- k × 8 bits INT8 mantissas,
- 8 bits per block for the exponent (scale)
- Optionally, a metadata field (see below).
Scales are either packed at the start of each block, in separate arrays, or encoded as per hardware alignment for coalesced memory access. MXINT8 naturally aligns with 256-bit SIMD lanes (32 × 8 bits) and vectorized dot-product units (Rouhani et al., 2023, Xiao et al., 28 May 2025, Cuyckens et al., 9 Nov 2025).
Table: Representative Physical Layouts
| Format Variant | # Elements per Block | Exponent Bits | Mantissa Bits/Element | Metadata |
|---|---|---|---|---|
| MXINT8 (standard) | 32 | 8 | 8 | None |
| MXINT8 (metadata-augmented) | 32 | 8 | 8 | 24 bits/group |
| MXINT8 (square-group) | 64 | 8 | 8 | None |
(Rouhani et al., 2023, Hu et al., 27 Jan 2026, Cuyckens et al., 28 May 2025)
4. Hardware Pipelines and Accelerator Design
MXINT8 is engineered for maximal hardware efficiency:
- Multiply-accumulate datapath: Integer-only dot-products ( bits) with exponent alignment at block boundaries—no per-element normalization, no floating-point mantissa arithmetic (Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025, Xiao et al., 28 May 2025).
- Block exponent sharing: One scale per block; broadcast once per vector lane or SRAM tile.
- Precision-scalable MACs: Unit supports mixed INT8/FPx modes using sub-word parallelism; 2–4 higher arithmetic throughput than FP32 (Cuyckens et al., 9 Nov 2025, Cuyckens et al., 28 May 2025).
- Energy and area: Reductions of up to 2 in multiplied energy vs. FP32 and ~ vs. uniform INT8. Area reductions of 25–50% observed in robotics and NPU implementations (Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025).
- System integration: MXINT8 is adopted in Blackwell (Nvidia), SNAX, TriGen, MASE, and OPAL; supports tile streaming/fused-transform in FPGAs/NPUs and direct mapping of all ViT/LLM kernels (Xiao et al., 28 May 2025, Lee et al., 13 Feb 2026, Cuyckens et al., 28 May 2025, Koo et al., 2024, Cheng et al., 2023).
Metadata-Augmented Variants
Recent research augments MXINT8 with lightweight per-block metadata (e.g., 3 bits per subgroup for extra mantissa precision), gaining up to 60% accuracy restoration over plain 4-bit MXFP4 at minimal (<5%) area/power cost (Hu et al., 27 Jan 2026).
5. Empirical Performance and Use Cases
Across large-scale LLMs, ViTs, and edge robotics:
- LLMs: No accuracy cliff as seen in uniform INT8; PPL within 1% of FP16. Hadamard-rotated MXINT8 further improves SNR where crest factor is high (Chen et al., 29 Oct 2025, Sharify et al., 2024).
- ViTs: MXINT8 enables full-model mapping, including Softmax/LayerNorm, yielding <1% Top-1 drop and 93–1024 speedup versus FP16 flows (Xiao et al., 28 May 2025).
- Robotics & continual learning: Area/memory halved and 4× throughput at negligible accuracy cost on control/reinforcement tasks (Cuyckens et al., 28 May 2025).
- FFT Pipelines: End-to-end normalized MSE scales as , with quantization error set by block mantissa width (worst-case for ) (Deveshwar et al., 3 Dec 2025).
- Post-Training Quantization (PTQ): Works synergistically with SmoothQuant, GPTQ, AWQ; MXINT8 alone achieves near-baseline perplexity (Sharify et al., 2024).
6. Limitations, Accuracy-Performance Trade-offs, and Hybrid Remedies
Quantization bias and instability: In full-precision LLM training, pure blockwise quantization of all weights/activations and LayerNorm parameters can induce instabilities (“loss spikes”) due to catastrophic clamping of clustered LayerNorm parameters (Su et al., 25 Jun 2025). Symmetric clipping ([−127, 127] code range) is essential to eliminate gradient bias in STE-based training (Chen et al., 29 Oct 2025).
Block size trade-offs: Larger blocks reduce exponent overhead but expose the format to skew-induced coarse quantization error. Empirically, is a standard choice, balancing metadata cost and quantization noise. For FFT and ViT hardwares, scales with tile sizes and operator-specific SNR targets (Deveshwar et al., 3 Dec 2025, Xiao et al., 28 May 2025).
Instability mitigation: Two robust hybrid schemes restore BF16-equivalent scaling laws and stability:
- Quantize only weights to MXINT8; keep activations/LayerNorm in BF16.
- Quantize only the forward GEMMs/matmul ops to MXINT8; accumulation/grads in higher precision (Su et al., 25 Jun 2025).
Outlier handling: MXINT8 with outlier-preserved extensions (e.g., OPAL) stores a fixed number of largest activations per block in BF16, preserving accuracy under high skew with minor area/latency overhead (Koo et al., 2024).
7. Extensions and Recent Developments
- Metadata augmentation: Incorporating subgroup extra-mantissa (Sg-EM) and element-level (Elem-EM) correction increases effective precision, with typical EBW rising to 9 bits/element (vs. 8) and accuracy approaching bfloat16 (Hu et al., 27 Jan 2026).
- Rotation-based outlier mitigation: Pre-block Hadamard rotation spreads energy, enhances SNR, and enables even 4-bit integer MX formats (NVINT4) to outperform floating-point at matched block sizes (Chen et al., 29 Oct 2025).
- End-to-end hardware datapath optimization: Full-FPGA/NPU accelerator designs now systematically pipeline all attention and normalization ops in MXINT8, eliminating CPU fallback and maximizing bounding throughput on edge/embedded deployments (Xiao et al., 28 May 2025, Lee et al., 13 Feb 2026).
MXINT8, by fusing block-shared power-of-two scaling, symmetric integer representation, and highly efficient hardware mapping, defines a new standard for low-bit quantization in both inference and training. Its practical tractability, theoretical transparency, and universal adoption across industrial and academic accelerator designs have established it as the leading integer-centric microscaling format for scalable deep neural computation (Rouhani et al., 2023, Su et al., 25 Jun 2025, Chen et al., 29 Oct 2025, Sharify et al., 2024, Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025, Deveshwar et al., 3 Dec 2025, Hu et al., 27 Jan 2026, Lee et al., 13 Feb 2026, Xiao et al., 28 May 2025, Cheng et al., 2023, Gorodecky et al., 2024).