MXFP Formats: Low-Bitwidth Floating Point Design

Updated 14 March 2026

MXFP Formats are low-bitwidth, block-oriented floating point encodings that use shared per-block scaling to preserve a dynamic range comparable to FP32.
They employ a quantization algorithm that groups tensor elements into fixed-size blocks, applying power-of-two scaling and clipping to reduce memory footprints in neural network computations.
Innovations such as Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS) improve MXFP precision, enabling MXFP4 to achieve accuracy close to higher-bit formats with minimal hardware overhead.

Microscaling Floating Point (MXFP) formats comprise a family of block-oriented, low-bitwidth number encodings standardized under the Open Compute Project (OCP) Microscaling (MX) initiative. MXFP formats have emerged as leading solutions for reducing memory and hardware requirements in large-scale neural network inference and training, without sacrificing the dynamic range needed for stable operation. Central to MXFP’s adoption in industry and academia is the trade-off between aggressive bitwidth reduction (as low as 4 bits per value) and preservation of numerical fidelity through shared per-block scaling.

1. Structure and Mathematical Definition of MXFP Formats

MXFP representations decompose tensors into fixed-size blocks, each associated with a single scaling factor and multiple narrow-width floating-point (FP) element codes. A block of size $N$ (default: $N=32$ ) encodes real values $\{x_i\}$ via:

A block scale $s_b \in \mathrm{E8M0}$ : 8-bit positive exponent, no mantissa, representing $s_b=2^{e_b}$ .
Element data: Each $q_i$ is a narrow FP ( $\mathrm{FP}_{n_e,n_m}$ , i.e., $n_e$ exponent, $n_m$ mantissa bits).

The dequantization formula is

$\hat x_i = s_b \cdot \mathrm{FP}_{n_e,n_m}(q_i)$

where the interpretation of $q_i$ (sign, exponent, mantissa, bias) follows standard IEEE format except for the small bit-widths and the role of the shared block scale (Rouhani et al., 2023). A widely deployed parameterization is MXFP4 (E2M1: 1 sign, 2 exponent, 1 mantissa), occupying 4 bits per element, and MXFP8 (E4M3: 1 sign, 4 exponent, 3 mantissa), at 8 bits.

2. Quantization and Dequantization Algorithms

Given an input tensor, elements are grouped into disjoint blocks. For each block $b$ :

Compute maximum magnitude: $\alpha_{\max} = \max_{i \in b} |x_i|$ .
Set power-of-two scale: $s_b = 2^{\lfloor \log_2(\alpha_{\max}) \rfloor}$ .
Normalize and quantize: $q_i = \mathrm{clip}_{[-q_{\max},q_{\max}]}(\mathrm{round}(x_i / s_b))$ , where $q_{\max}$ is determined by the element FP format.
At inference, dequantization is $\hat x_i = s_b \cdot q_i$ (Chhugani et al., 30 Jan 2026, Zhang et al., 14 Jan 2026, Rouhani et al., 2023).

Block-level scaling drastically extends the representable dynamic range (matching FP32 range for the block) even for 4- or 6-bit elements. Rounding is to nearest-even. Overflow or subnormals are handled via clamping or zeroing, respectively, in most FPGA/ASIC implementations (Samson et al., 2024).

3. Precision, Dynamic Range, and Format Trade-offs

Precision is set by per-element mantissa width ( $n_m$ ). For MXFP4 (E2M1, $n_m = 1$ ), relative precision is $2^{-1} = 0.5$ ULP. MXFP8 (E4M3, $n_m = 3$ ) provides $0.125$ ULP. The exponent width on the element (e.g., 2 bits in E2M1 for MXFP4) covers a modest range within the block, but the overall per-block scale enables dynamic ranges on the order of $10^{\pm 38}$ (matching FP32) (Rouhani et al., 2023, Zhang et al., 14 Jan 2026, Samson et al., 2024).

Key design trade-offs:

Format	Bits/element	Mantissa bits	ULP	Usable Dynamic Range (per block)
MXFP8	8	3	0.125	$2^{-6} \dots 2^{7}$
MXFP6	6	3	0.125	$2^0 \dots 2^1$
MXFP4	4	1	0.5	$2^0 \dots 2^1$

Reducing to 4 bits achieves aggressive model compression and hardware area savings (up to $8\times$ reduction vs. FP32, $12\%$ area savings vs. NVFP4 on NVIDIA B200 tensor cores), but coarse quantization and limited exponent lead to increased quantization error, especially for values distant from the block max (Chhugani et al., 30 Jan 2026, Rouhani et al., 2023, Samson et al., 2024).

4. Quantization Error Mitigation: OAS and MBS

The accuracy gap of MXFP4 relative to higher-precision formats has motivated specific techniques:

Overflow-Aware Scaling (OAS)

OAS opportunistically increases the block scale by one bit (doubling $s_b$ ) if the block maximum lies in the lower half of the current scale's bin. This acts to expand the effective representable range for the block tail without exacerbating overflow for the block maximum. In effect:

If $\alpha_{\max} \in [2^{e_b^0}, 2^{e_b^0} + 0.5 \cdot 2^{e_b^0})$ , set $s_b = 2^{e_b^0 + 1}$ .
Empirically raises Quantization SNR by $0.5$ dB and narrows the accuracy gap to NVFP4 by $2-4\%$ (Chhugani et al., 30 Jan 2026).

Macro Block Scaling (MBS)

Outliers often dominate quantization error due to the coarse E8M0 exponent without a block mantissa. MBS stores an additional 8-bit mantissa per macro-block (e.g., 128 elements), significantly improving dynamic range for the most challenging blocks. Static MBS (MBS-S) directly extracts the top 8 bits of the optimal mantissa, while Dynamic MBS (MBS-D) uses a LUT-driven search to minimize macro-block quantization error. MBS-D yields an extra $\sim0.5$ dB QSNR over MBS-S (Chhugani et al., 30 Jan 2026).

A hybrid OAS+MBS-D pipeline achieves accuracy within 1% of NVFP4 at only $6.2\%$ GEMM overhead, with no hardware modifications (Chhugani et al., 30 Jan 2026).

5. Comparison with Competing Low-Bit Formats

MXFP is best understood against the backdrop of prior low-precision quantization approaches:

INT8: Narrow exponent/mantissa decoupling limits representable range.
NVFP4 (NVIDIA): E4M3 format offers more mantissa bits but with a higher hardware cost.
MXFP8 and MXFP6: Lossless/near-lossless at 8/6-bit, respectively. MXFP4, unless mitigated, introduces substantial loss (perplexity increase, accuracy drop) for both weights and activations.

MX+ (MXFP4+): An extension that detects the maximum element (block-max, BM) and reclaims its exponent bits as extra mantissa, sharply reducing BM quantization error with only a marginal storage increase ( $+0.25$ bits/element) (Lee et al., 16 Oct 2025). This yields perplexity and accuracy competitive with FP16 or MXFP6 for LLM inference, while retaining MXFP4's computational and area advantages.

6. Practical Recommendations and Empirical Observations

Empirical results and deployment advice across MXFP studies:

MXFP8 (E4M3): Empirically lossless for both weights and activations on LLMs and vision models, even with naïve rounding (Zhang et al., 14 Jan 2026, Rouhani et al., 2023).
MXFP4 (E2M1): Unusable for activations unless mitigated (MX+, OAS, MBS); with mitigation, LLM test accuracy within 1% of NVFP4, inference speed penalty $<10\%$ (Chhugani et al., 30 Jan 2026, Lee et al., 16 Oct 2025).
For training, block normalization of layer-norm parameters can trigger divergence due to cluster saturation (all elements hitting the maximal code). Hybrids—such as only quantizing weights or using BF16 for activations/layer-norm—fully stabilize stochastic gradient descent and recover expected scaling laws (Su et al., 25 Jun 2025).
FFT/MRI imaging: Mantissa precision is the limiting factor for MXFP format fidelity; three mantissa bits (as in E4M3/E2M3) are strongly favored over single-bit (E2M1) (Deveshwar et al., 3 Dec 2025).

7. Hardware and Software Integration

The MXFP family’s hardware efficiency stems from block-aligned, power-of-two scaling and element-wise narrow multipliers. FPGA implementations exploit the absence of mantissa from the block scale and tiny mantissa multipliers, mapping these to LUTs rather than DSPs (DSP-free floating point for $m \leq 3$ ). ASIC support requires only a small fraction of extra SRAM and adder width for block sizes $N=16,32$ ; the combination of features enables area and energy scaling not possible with INT or prior FP formats (Samson et al., 2024).

MXFP quantization is now supported in open-source CUDA/PyTorch (OCP MX reference, Brevitas integration) as drop-ins for model quantization pipelines, with scale computation and block grouping matching hardware-execution patterns (Rouhani et al., 2023, Samson et al., 2024).

References:

"Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction" (Chhugani et al., 30 Jan 2026)
"Benchmarking Post-Training Quantization of LLMs under Microscaling Floating Point Formats" (Zhang et al., 14 Jan 2026)
"MX+: Pushing the Limits of Microscaling Formats for Efficient LLM Serving" (Lee et al., 16 Oct 2025)
"Microscaling Data Formats for Deep Learning" (Rouhani et al., 2023)
"Characterization and Mitigation of Training Instabilities in Microscaling Formats" (Su et al., 25 Jun 2025)
"Exploring FPGA designs for MX and beyond" (Samson et al., 2024)
"A Mixed Precision FFT with applications in MRI" (Deveshwar et al., 3 Dec 2025)