Microscaling (MX) Standard
- Microscaling (MX) standard is a data representation method that uses block-shared exponents to enable efficient low-precision computation in deep learning.
- It employs block quantization by encoding groups of elements with a shared scale, using formats like MXFP8, MXFP6, MXFP4, and MXINT8 to optimize memory and compute limits.
- The standard underpins advanced hardware acceleration on FPGAs, GPUs, and NPUs and supports mixed-precision techniques to balance performance with minimal accuracy loss.
Microscaling (MX) Standard
Microscaling (MX) is a widely adopted industry standard for efficient low-precision data representation and computation in deep learning, neural network inference, and training. It formalizes a block-floating-point approach where groups of elements (typically 32 or 64) share a high-precision exponent, while each element encodes its value using a compact integer or narrow floating-point representation. The MX standard, now integrated into multiple hardware and software stacks, enables significant reductions in memory footprint, memory bandwidth, and compute complexity across a range of platforms and applications.
1. Mathematical Foundations and Format Definition
MX formats encode contiguous blocks of elements with a single shared scale, typically a power-of-two (E8M0, 8 bits, zero mantissa), and per-element codewords in either integer (MXINT) or floating-point (MXFP) formats. The concrete layouts, as specified in the OCP MX specification and adopted in leading research (Liu et al., 4 Aug 2025, Rouhani et al., 2023, Lee et al., 16 Oct 2025), are summarized below.
| Format | Elem. Width | Elem. Type | Exponent/Mantissa | Shared Scale | Range/Precision |
|---|---|---|---|---|---|
| MXFP8 | 8 bits | FP8 | E5M2 / E4M3 | E8M0 (8 bit) | Wide/Narrow |
| MXFP6 | 6 bits | FP6 | E3M2 / E2M3 | E8M0 | Intermediate |
| MXFP4 | 4 bits | FP4 | E2M1 | E8M0 | Minimal/coarse |
| MXINT8 | 8 bits | Integer | 2’s comp. | E8M0 | ±128, integer steps |
Each MX block stores k=32 elements (or k=64 in square/block-tiled variants), with one shared 8-bit scale per block.
Quantization and dequantization employ the following:
- Quantization: For a block , compute , where is the exponent bias of the selected MXFP variant. Each element quantizes: .
- Dequantization: .
- Dot Product: For two blocks with their scales, block-wise computation is .
MXFP variants (E5M2, E4M3, etc.) allow designers to trade dynamic range vs. precision by allocating bit-widths for exponent/mantissa. Lower precision formats (e.g., MXFP4) yield higher density and performance; wider formats improve accuracy and robustness.
2. Quantization, Block Structure, and Thresholding
MX quantization is performed per block of k elements:
- Scale selection: Shared scale for each block is determined so that the block's largest magnitude fits the representable range after scaling.
- Element encoding: After scaling and rounding, each value is encoded as per the format table (above).
- Quantization thresholding and upcasting (Liu et al., 4 Aug 2025): To bound error, MicroMix derives per-format quantization thresholds such that error from e.g., MXFP4 quantization . If , the value is upcast to a higher precision (e.g., MXFP6 or MXFP8), enabling mixed-precision flow per layer or channel.
Block size () is typically set to 32 for weights/activations for memory bandwidth efficiency. For training or transposed access, square 8×8 (k=64) tiles may be used to facilitate both forward and backward data layouts without redundant storage (Cuyckens et al., 28 May 2025).
Calibration, as realized in MicroMix and MASE (Cheng et al., 2023), typically uses a small calibration dataset set to compute per-channel statistics. Channels are then partitioned into MXFP4, MXFP6, and MXFP8 sets, ordered by quantization error or channel importance, enabling maximally efficient mixed-precision inference kernels.
3. Precision-Range Tradeoffs, Emerging Variants, and Outlier Handling
Accuracy and compression trade directly with mantissa/exponent bit allocations:
| Format | Range | Precision | Drop (w/o finetuning, GPT/Llama) |
|---|---|---|---|
| MXFP8 | (E5M2) | $1/4$-6.6% | <0.1–0.5% |
| MXFP6 | (E3M2) | $1/4$ | 0.5–2% |
| MXFP4 | (E2M1) | $1/2$ | 5–30%* |
| MXINT8 | ±128 | Integer | <0.2% |
* MXFP4 degrades strongly without outlier mitigation or format extension.
Element outliers (activation values with much larger magnitudes than their neighbors) create significant error in low-bit MXFP. They force either large shared scales (causing nearly all other block values to quantize to zero) or severe clipping. MX+ (Lee et al., 16 Oct 2025) and AMXFP4 (Lee et al., 15 Nov 2024) address this limitation. MX+ repurposes the exponent bits in the outlier (block max, BM) element as an extended mantissa, dramatically increasing representational precision for the lone outlier, improving e.g., zero-shot LLM accuracy by 42% relative to vanilla MXFP4 at only +0.25 bits/element storage cost. AMXFP4 introduces asymmetric positive and negative scales, using distinct scales for positive and negative numbers within each block, directly reducing asymmetry error without calibration.
Outlier-aware block quantization also arises in OPAL (Koo et al., 6 Sep 2024), which tags the largest n values per block as "exceptions" and encodes them in higher precision, quantizing only the non-outliers via block scaling.
4. Hardware Acceleration and Implementation Strategies
The MX standard underpins co-designed hardware/software inference and training flows across FPGAs, GPUs (notably NVIDIA Blackwell), NPUs, and RISC-V-based accelerators.
- FPGA implementations (Samson et al., 1 Jul 2024, Gorodecky et al., 5 Nov 2024): Efficient hardware blocks exploit memory-free three-step conversions (max search, shared scale, quantizer bank) for rate-optimal pipeline. Element formats are parameterized, supporting all six MX variants, with LUT costs scaling linearly in exponent/mantissa width.
- RISC-V/ISA Extensions (İslamoğlu et al., 19 May 2025): The MXDOTP ISA primitive implements the MX block dot product as a single new instruction, supporting fused scale alignment, dot, and accumulation efficiently. This approach yields 25× speedup and 12.5× energy efficiency improvement over FP8→FP32 conversion baselines with negligible area overhead.
- Precision-scalable MACs and NPU datapaths (Cuyckens et al., 9 Nov 2025, Cuyckens et al., 28 May 2025): Recent work demonstrates hybrid integer/floating accumulation in MX MAC units: integer-style reduction for most products, with a single FP32 accumulation for partial sums. This enables high-throughput (e.g., MXFP4 at 512 GOPS) and energy efficiency (4065 GOPS/W for MXFP4, 1438–1675 GOPS/W for MXFP6/8).
- GPU kernels (Liu et al., 4 Aug 2025): Blackwell Tensor Cores natively support MXFP4, MXFP6, and MXFP8, allowing fused block-scaled loading and on-the-fly dequantization in tensor memory. MicroMix leverages CUTLASS-based fused kernels (reorder, quantize, and map) for maximum throughput.
Implementation typically arranges shared scale and element data in compact memory layouts (packed subwords), facilitating SIMD/SIMT execution. Systems targeting end-to-end MX quantization must ensure support for special value propagation, error-free (Kulisch-style) integer accumulation, multi-block cross-scale dot products, and per-block instruction-level parallelism.
5. Mixed-Precision, Compiler Support, and Software Ecosystem
State-of-the-art MX-aware compilers and quantizers such as MASE (Cheng et al., 2023) automate the search for per-tensor mantissa sizes, optimizing the allocation of exponent/mantissa bits to balance accuracy, memory, and area. The mixed-precision regime (e.g., weighting more critical channels or outlier blocks at higher bits) is critical in LLM and ViT workloads. MicroMix (Liu et al., 4 Aug 2025) and MixDiT (Kim et al., 11 Apr 2025) exemplify algorithm-hardware codesign enabling per-channel or per-head adaptation in transformer architectures.
MX is integrated into open-source quantization libraries (PyTorch-Brevitas (Samson et al., 1 Jul 2024), HuggingFace, LLM serving infrastructures), and is supported by major model serving engines and post-training quantization frameworks (e.g., GPTQ, AWQ, SmoothQuant extended to MX format (Sharify et al., 12 May 2024)).
For training, best practices (Su et al., 25 Jun 2025) advise using MX quantization on weights only (with BF16 activations) or forward-only quantization to avoid LayerNorm-induced instabilities. Monitoring gradient noise and activating higher-precision fallbacks when divergence is detected is standard.
6. Application Results, Impact, and Design Guidelines
The MX standard enables extensive compression and acceleration with minor accuracy loss:
- LLM Inference: MicroMix achieves ≥95% FP16 zero-shot accuracy for Llama models, 20–46% kernel-level speedup, and 20% memory savings vs. TRT-FP8 baselines (Liu et al., 4 Aug 2025).
- Vision Transformers: MXInt (M=6–8) achieves <0.1% top-1 loss at 4.99× memory reduction and ≥93× inference speedup vs. Float16 (Xiao et al., 28 May 2025).
- Hardware efficiency: FPGA and NPU implementations demonstrate area and power efficiencies unattainable with fixed-point/FP8-only designs, reaching Pareto-optimal area-accuracy on ImageNet and supporting real-time edge workloads in robotics and continual learning (Cuyckens et al., 28 May 2025, Samson et al., 1 Jul 2024).
- KV-cache and sequence quantization: MXFP6 (E3M2) reduces LLM model/prompt footprints by 30–40% at <0.05 perplexity loss; MXFP4 achieves even higher density (up to 70%) but at the cost of perceptible PPL drop (Lo et al., 15 Dec 2024).
Designers are advised to select the lowest mantissa width and exponent configuration consistent with task accuracy, use k=32 block size for inference, and adopt square block grouping for training or transposed access (Lee et al., 16 Oct 2025). For extreme compression, outlier-preserving variants (MX+, AMXFP4, OPAL) or asymmetric quantization can dramatically improve 4-bit MX accuracy.
7. Limitations, Ongoing Developments, and Related Standards
Core limitations of vanilla MX formats arise under ultra-low-bit quantization (MXFP4/E2M1): outlier-induced error, group-wise asymmetry, and quantization level sparsity. Protocol-compliant extensions such as MX+ (Lee et al., 16 Oct 2025) and AMXFP4 (Lee et al., 15 Nov 2024) have proven indispensable in practical LLM serving and low-resource inference, achieving near-MXFP8 quality at 4.25 bits/element.
Training remains challenging under full forward + backward MX quantization. Gradient bias from block-overflow, particularly in layer-norm affine blocks, requires hybrid precision and runtime error monitoring. For applications with severe outlier prevalence, per-block or per-head dynamic precision assignment and index tagging (as in MX+, OPAL, MixDiT) are now mainstream.
The MX standard continues to evolve under the OCP and major hardware vendors, forming the basis for future developments such as nanoscaling (NxFP) (Lo et al., 15 Dec 2024), which further refines per-element scaling and code efficiency for next-generation LLM architectures.
In summary, the Microscaling (MX) standard formalizes efficient, hardware-friendly low-precision formats essential to modern AI acceleration. Its block-shared exponent protocols, flexible exponent/mantissa ratios, and broad ecosystem integration make it a foundational component of scalable deep neural network deployment, spanning LLMs, ViTs, edge robotics, and beyond.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free