Microscaling (MX) Formats

Updated 3 July 2026

Microscaling (MX) is a family of block-structured numeric formats that share a common block exponent to enable efficient sub-8-bit arithmetic with high memory density.
MX formats, such as MXFP8, MXFP6, and MXFP4, use block-wise quantization to balance dynamic range and precision, optimizing hardware performance for deep learning workloads.
Adopted in AI accelerators and embedded systems, MX also incorporates advanced extensions and mixed-precision strategies for robust outlier handling and energy-efficient computing.

Microscaling (MX) is a family of block-structured numeric formats standardized by the Open Compute Project (OCP) and adopted broadly for machine learning inference, training, and signal processing accelerators. MX formats achieve substantial compression and arithmetic efficiency by sharing a scale or exponent across small groups of values, enabling hardware-efficient implementation of sub-8-bit floating-point and integer arithmetic while maintaining sufficient dynamic range and accuracy for deep learning workloads. MX is now a foundational format for nearly all next-generation AI accelerators, LLM inference engines, and emerging embedded AI solutions.

1. Formal Definition, Bit-Level Layout, and Core MX Variants

Microscaling formats partition a vector or tensor into blocks of size $B$ (typically $B=32$ ), encoding each as a tuple comprising:

A single block exponent ("micro-exponent", 8 bits in E8M0 format) shared by all elements in the block.
$B$ element codes, where each element is a low-bit floating-point or integer mini-format carrying a sign, truncated exponent, and/or mantissa.

Bit Layout Example Table (MXFP4, MXFP6, MXFP8)

Format	Element Structure	Bits per Element	Block Header
MXFP8 E4M3	S1 E4 M3	8	8 bits (E8M0)
MXFP6 E2M3	S1 E2 M3	6	8 bits (E8M0)
MXFP4 E2M1	S1 E2 M1	4	8 bits (E8M0)
MXINT8	S1 M7 (int8)	8	8 bits (E8M0)

The block exponent encodes a shared floating-point scale, while each element is quantized relative to this scale. For MXFP formats:

$x_i = (-1)^{s_i} \cdot 2^{(B - b_e) + (e_i - b_e)} \cdot m_i, \quad\text{with}\quad m_i = (1 + f_i/2),\ (e_i \ne 0)$

where $s_i$ is the sign, $e_i$ the element exponent, $f_i$ the mantissa, $B$ the block exponent, $b_e$ bias.

MXINT ("Microscaling Integer") encodes each element as a signed integer mantissa; the block scale ensures correct dynamic range. General formula:

$x_i = (-1)^{s_i} \cdot m_i \cdot 2^{e_{block}}$

By amortizing the exponent cost, MX formats provide high memory density (e.g., MXFP4 achieves $B=32$ 0 bits/element) with almost no overhead.

2. Quantization, Dequantization, and Hardware Efficiency

Quantization Procedure

Block Division: Partition tensor into blocks of $B=32$ 1 elements.
Exponent Selection: For each block, compute $B=32$ 2, where $B=32$ 3 is related to mantissa bits.
Element Quantization: Compute $B=32$ 4, clamp to permissible interval.
Encoding: Store $B=32$ 5.

Dequantization is a single block read of the scale followed by integer (shift-and-add) multiply on each element:

$B=32$ 6

The hardware-friendly structure aligns with dense matrix-multiplication dataflows, enabling fused multiply-accumulate (MAC) pipelines on small-mantissa operands, with a single exponent broadcasting logic per block. Contemporary FPGA and ASIC implementations typically support all six OCP-standard MX types (e.g., E5M2, E4M3, E3M2, E2M3, E2M1, INT8) with sub-word parallelism (Cuyckens et al., 28 May 2025, Gorodecky et al., 2024).

On supported hardware (e.g., NVIDIA Blackwell Tensor Cores), the MX block exponent enables shift-only dequantization, eliminating the need for wide multipliers.

3. Dynamic Range, Precision, and Error-Bound Analysis

MX achieves a dynamic range comparable to wide floating-point formats by shifting the exponent on a per-block basis, rather than per-element. For low-bitwidth formats:

Precision: Determined chiefly by the element's mantissa width ( $B=32$ 7 bits); maximum relative error within a block is $B=32$ 8.
Dynamic Range: Determined by the block exponent (8 bits by standard), allowing per-block exponent adjustment.
Error Model: Quantization error is bounded by

$B=32$ 9

where $B$ 0 is the block scale.

For instance, in MXFP6 (E2M3), which is widely used in LLM inference, the 3-bit mantissa yields a blockwise relative error $B$ 1, but the block exponent supports a DR of $B$ 2 when using 8 bits (E8M0) (Cheng et al., 2023, Rouhani et al., 2023).

4. Advanced MX Extensions and Outlier Handling

Outlier-Induced Precision Loss and Remedies

Blocks containing activation or weight outliers can suffer from excessive scale inflation, compressing non-outlier elements and increasing quantization error. MX4/6 (4/6-bit formats) are particularly sensitive.

To mitigate this:

Rotation-based smoothing: DuQuant++ applies a learned, block-diagonal orthogonal rotation tailored to each MX block, redistributing outlier magnitude and minimizing block maxima (Lin et al., 20 Apr 2026).
Metadata-Augmented MX: M²XFP and MX+ insert minimal metadata per block (e.g., extra mantissa bits for outliers) to locally boost precision with negligible storage or computational overhead (Hu et al., 27 Jan 2026, Lee et al., 16 Oct 2025).
Nanoscaling: NxFP introduces nanoprecision mantissa bits for the block scale and per-block adaptivity (block can select pure BFP or MxFP encoding based on statistics), reducing error for extreme low-bit regimes (Lo et al., 2024).

These methods yield significant gains—e.g., MX+ reduces MXFP4's perplexity deficit by over 40% at a cost of $B$ 30.25 bits/element (Lee et al., 16 Oct 2025); M²XFP similarly closes most of the 4-bit MX-accuracy gap with $B$ 4 hardware area increase (Hu et al., 27 Jan 2026).

5. Mixed-Precision Assignment and Application-Specific MX Pipelines

Uniform precision assignment is sub-optimal for large models. Recent work demonstrates that allocating precision adaptively, per-layer or per-channel, achieves Pareto-optimal quality vs. bit-width (Franco et al., 2 Jun 2026, Liu et al., 4 Aug 2025, Kim et al., 11 Apr 2025).

Differentiable Mixed-Precision (dMX): Per-layer exponents/mantissas parameterized as learnable offsets, annealed to hardware-realizable formats during calibration, yield consistently better perplexity and reasoning accuracy at a given average bit-width (Franco et al., 2 Jun 2026).
Mixed-precision heuristics: MicroMix and MixDiT perform outlier-aware assignment at the channel or head level, allocating MXFP8 to high-magnitude channels and MXFP4/6 to the rest, preserving accuracy while maximizing hardware throughput (Liu et al., 4 Aug 2025, Kim et al., 11 Apr 2025).
Compiler-Driven Orchestration: MASE statically assigns bit-widths via a hardware-software joint optimization loop, balancing performance, memory, and area usage for custom MX accelerators (Cheng et al., 2023).

6. Algorithmic, Training, and Application Frontiers

MX supports both post-training quantization and quantization-aware training. For LLM and ViT inference, direct-cast MXINT8 or MXFP6 achieves FP32-equivalent accuracy with no retraining, while training can proceed in all-MX using standard optimizers (Adam, LAMB) (Rouhani et al., 2023, Su et al., 25 Jun 2025, Park et al., 23 May 2026).

Key capabilities and empirical results:

MXINT8, MXFP8, and MXFP6/4 sustain $B$ 50.5% accuracy drop on major CV (ImageNet, ResNet), NLP (WMT, GLUE), and speech benchmarks (Rouhani et al., 2023).
On large LLMs, MXFP8/6 (W8A8/W6A6) quantization with SmoothQuant, GPTQ, or DuQuant++ gives virtually lossless compression to 4–8 bits/element; MXINT6-8 achieves perfect scaling-law compliance up to 1.5B parameters (Sharify et al., 2024, Lin et al., 20 Apr 2026).
In computer vision, LayerNorm, Softmax, and GELU can be implemented as integer-plus-LUT datapaths under MX, producing up to $B$ 6 speedups on FPGA (DeiT-based ViT accelerator) (Xiao et al., 28 May 2025).
Robotics hardware exploiting square-microblock grouping and sub-word FP units (INT8, FP8, FP6, FP4) achieves a $B$ 7 memory reduction and $B$ 8 throughput improvement at <1% accuracy drop (Cuyckens et al., 28 May 2025).

For training, all-MX can introduce gradient bias and instabilities, especially if LayerNorm parameters are quantized. It is sufficient to use hybrid recipes—MX for weights, higher precision (BF16) for activations/LayerNorm—recovering stability and full precision accuracy (validation loss difference $B$ 9 vs. BF16) (Su et al., 25 Jun 2025).

7. Hardware Support and Standardization

The OCP MX v1.0 standard and a vendor-neutral, bit-exact conformance suite now underpin cross-vendor implementations (Vasilev, 8 Jun 2026). MX is supported in:

Modern GPU tensor cores (NVIDIA Blackwell: FP4, FP6, MX block formats).
RISC-V ISA extensions: MXDOTP allows single-instruction MXFP8 dot products (fusing exponent, multiply, and accumulation) at 356 GFLOPS/W, with $x_i = (-1)^{s_i} \cdot 2^{(B - b_e) + (e_i - b_e)} \cdot m_i, \quad\text{with}\quad m_i = (1 + f_i/2),\ (e_i \ne 0)$ 025\times $speedup over software baselines (<a href="/papers/2505.13159" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">İslamoğlu et al., 19 May 2025</a>).</li> <li>Vector ISA extensions (VMXDOTP): Fuses MXFP4/8 dot products and block-exponent logic into pipeline-parallel vector instructions, sustaining$ x_i = (-1)^{s_i} \cdot 2^{(B - b_e) + (e_i - b_e)} \cdot m_i, \quad\text{with}\quad m_i = (1 + f_i/2),\ (e_i \ne 0) $197\%$ x_i = (-1)^{s_i} \cdot 2^{(B - b_e) + (e_i - b_e)} \cdot m_i, \quad\text{with}\quad m_i = (1 + f_i/2),\ (e_i \ne 0)$21.6$ TFLOPS/W (Wipfli et al., 5 Mar 2026).
Precision-scalable MAC arrays with optimized reduction trees: MXFP8/6/4 at $x_i = (-1)^{s_i} \cdot 2^{(B - b_e) + (e_i - b_e)} \cdot m_i, \quad\text{with}\quad m_i = (1 + f_i/2),\ (e_i \ne 0)$ 34000 GOPS/W, with only 0.26% power/area overhead for dynamic metadata augmentations (Cuyckens et al., 9 Nov 2025, Hu et al., 27 Jan 2026).

Conformance and validation: The 84-format MX numeric catalog with SHA256-fingerprinted conformance vectors ensures precise format mapping across vendor toolchains and hardware (Vasilev, 8 Jun 2026).

In summary, microscaling (MX) formats realize a unifying hardware-software substrate for sub-8-bit AI, combining block-exponent scaling with flexible low-bit minifloats/integers and advanced outlier-handling extensions. MX achieves a balance of dynamic range, precision, and low system cost previously unattainable in fixed-point or per-value floating-point schemes. Applications encompass LLM and ViT inference/training, continual learning in robotics, high-fidelity scientific transforms, and next-generation embedded edge AI (Rouhani et al., 2023, Xiao et al., 28 May 2025, Cuyckens et al., 28 May 2025, Hu et al., 27 Jan 2026).