Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXFP Formats: Low-Bitwidth Floating Point Design

Updated 14 March 2026
  • MXFP Formats are low-bitwidth, block-oriented floating point encodings that use shared per-block scaling to preserve a dynamic range comparable to FP32.
  • They employ a quantization algorithm that groups tensor elements into fixed-size blocks, applying power-of-two scaling and clipping to reduce memory footprints in neural network computations.
  • Innovations such as Overflow-Aware Scaling (OAS) and Macro Block Scaling (MBS) improve MXFP precision, enabling MXFP4 to achieve accuracy close to higher-bit formats with minimal hardware overhead.

Microscaling Floating Point (MXFP) formats comprise a family of block-oriented, low-bitwidth number encodings standardized under the Open Compute Project (OCP) Microscaling (MX) initiative. MXFP formats have emerged as leading solutions for reducing memory and hardware requirements in large-scale neural network inference and training, without sacrificing the dynamic range needed for stable operation. Central to MXFP’s adoption in industry and academia is the trade-off between aggressive bitwidth reduction (as low as 4 bits per value) and preservation of numerical fidelity through shared per-block scaling.

1. Structure and Mathematical Definition of MXFP Formats

MXFP representations decompose tensors into fixed-size blocks, each associated with a single scaling factor and multiple narrow-width floating-point (FP) element codes. A block of size NN (default: N=32N=32) encodes real values {xi}\{x_i\} via:

  • A block scale sbE8M0s_b \in \mathrm{E8M0}: 8-bit positive exponent, no mantissa, representing sb=2ebs_b=2^{e_b}.
  • Element data: Each qiq_i is a narrow FP (FPne,nm\mathrm{FP}_{n_e,n_m}, i.e., nen_e exponent, nmn_m mantissa bits).

The dequantization formula is

x^i=sbFPne,nm(qi)\hat x_i = s_b \cdot \mathrm{FP}_{n_e,n_m}(q_i)

where the interpretation of qiq_i (sign, exponent, mantissa, bias) follows standard IEEE format except for the small bit-widths and the role of the shared block scale (Rouhani et al., 2023). A widely deployed parameterization is MXFP4 (E2M1: 1 sign, 2 exponent, 1 mantissa), occupying 4 bits per element, and MXFP8 (E4M3: 1 sign, 4 exponent, 3 mantissa), at 8 bits.

2. Quantization and Dequantization Algorithms

Given an input tensor, elements are grouped into disjoint blocks. For each block bb:

  1. Compute maximum magnitude: αmax=maxibxi\alpha_{\max} = \max_{i \in b} |x_i|.
  2. Set power-of-two scale: sb=2log2(αmax)s_b = 2^{\lfloor \log_2(\alpha_{\max}) \rfloor}.
  3. Normalize and quantize: qi=clip[qmax,qmax](round(xi/sb))q_i = \mathrm{clip}_{[-q_{\max},q_{\max}]}(\mathrm{round}(x_i / s_b)), where qmaxq_{\max} is determined by the element FP format.
  4. At inference, dequantization is x^i=sbqi\hat x_i = s_b \cdot q_i (Chhugani et al., 30 Jan 2026, Zhang et al., 14 Jan 2026, Rouhani et al., 2023).

Block-level scaling drastically extends the representable dynamic range (matching FP32 range for the block) even for 4- or 6-bit elements. Rounding is to nearest-even. Overflow or subnormals are handled via clamping or zeroing, respectively, in most FPGA/ASIC implementations (Samson et al., 2024).

3. Precision, Dynamic Range, and Format Trade-offs

Precision is set by per-element mantissa width (nmn_m). For MXFP4 (E2M1, nm=1n_m = 1), relative precision is 21=0.52^{-1} = 0.5 ULP. MXFP8 (E4M3, nm=3n_m = 3) provides $0.125$ ULP. The exponent width on the element (e.g., 2 bits in E2M1 for MXFP4) covers a modest range within the block, but the overall per-block scale enables dynamic ranges on the order of 10±3810^{\pm 38} (matching FP32) (Rouhani et al., 2023, Zhang et al., 14 Jan 2026, Samson et al., 2024).

Key design trade-offs:

Format Bits/element Mantissa bits ULP Usable Dynamic Range (per block)
MXFP8 8 3 0.125 26272^{-6} \dots 2^{7}
MXFP6 6 3 0.125 20212^0 \dots 2^1
MXFP4 4 1 0.5 20212^0 \dots 2^1

Reducing to 4 bits achieves aggressive model compression and hardware area savings (up to 8×8\times reduction vs. FP32, 12%12\% area savings vs. NVFP4 on NVIDIA B200 tensor cores), but coarse quantization and limited exponent lead to increased quantization error, especially for values distant from the block max (Chhugani et al., 30 Jan 2026, Rouhani et al., 2023, Samson et al., 2024).

4. Quantization Error Mitigation: OAS and MBS

The accuracy gap of MXFP4 relative to higher-precision formats has motivated specific techniques:

Overflow-Aware Scaling (OAS)

OAS opportunistically increases the block scale by one bit (doubling sbs_b) if the block maximum lies in the lower half of the current scale's bin. This acts to expand the effective representable range for the block tail without exacerbating overflow for the block maximum. In effect:

  • If αmax[2eb0,2eb0+0.52eb0)\alpha_{\max} \in [2^{e_b^0}, 2^{e_b^0} + 0.5 \cdot 2^{e_b^0}), set sb=2eb0+1s_b = 2^{e_b^0 + 1}.
  • Empirically raises Quantization SNR by $0.5$ dB and narrows the accuracy gap to NVFP4 by 24%2-4\% (Chhugani et al., 30 Jan 2026).

Macro Block Scaling (MBS)

Outliers often dominate quantization error due to the coarse E8M0 exponent without a block mantissa. MBS stores an additional 8-bit mantissa per macro-block (e.g., 128 elements), significantly improving dynamic range for the most challenging blocks. Static MBS (MBS-S) directly extracts the top 8 bits of the optimal mantissa, while Dynamic MBS (MBS-D) uses a LUT-driven search to minimize macro-block quantization error. MBS-D yields an extra 0.5\sim0.5 dB QSNR over MBS-S (Chhugani et al., 30 Jan 2026).

A hybrid OAS+MBS-D pipeline achieves accuracy within 1% of NVFP4 at only 6.2%6.2\% GEMM overhead, with no hardware modifications (Chhugani et al., 30 Jan 2026).

5. Comparison with Competing Low-Bit Formats

MXFP is best understood against the backdrop of prior low-precision quantization approaches:

  • INT8: Narrow exponent/mantissa decoupling limits representable range.
  • NVFP4 (NVIDIA): E4M3 format offers more mantissa bits but with a higher hardware cost.
  • MXFP8 and MXFP6: Lossless/near-lossless at 8/6-bit, respectively. MXFP4, unless mitigated, introduces substantial loss (perplexity increase, accuracy drop) for both weights and activations.

MX+ (MXFP4+): An extension that detects the maximum element (block-max, BM) and reclaims its exponent bits as extra mantissa, sharply reducing BM quantization error with only a marginal storage increase (+0.25+0.25 bits/element) (Lee et al., 16 Oct 2025). This yields perplexity and accuracy competitive with FP16 or MXFP6 for LLM inference, while retaining MXFP4's computational and area advantages.

6. Practical Recommendations and Empirical Observations

Empirical results and deployment advice across MXFP studies:

  • MXFP8 (E4M3): Empirically lossless for both weights and activations on LLMs and vision models, even with naïve rounding (Zhang et al., 14 Jan 2026, Rouhani et al., 2023).
  • MXFP4 (E2M1): Unusable for activations unless mitigated (MX+, OAS, MBS); with mitigation, LLM test accuracy within 1% of NVFP4, inference speed penalty <10%<10\% (Chhugani et al., 30 Jan 2026, Lee et al., 16 Oct 2025).
  • For training, block normalization of layer-norm parameters can trigger divergence due to cluster saturation (all elements hitting the maximal code). Hybrids—such as only quantizing weights or using BF16 for activations/layer-norm—fully stabilize stochastic gradient descent and recover expected scaling laws (Su et al., 25 Jun 2025).
  • FFT/MRI imaging: Mantissa precision is the limiting factor for MXFP format fidelity; three mantissa bits (as in E4M3/E2M3) are strongly favored over single-bit (E2M1) (Deveshwar et al., 3 Dec 2025).

7. Hardware and Software Integration

The MXFP family’s hardware efficiency stems from block-aligned, power-of-two scaling and element-wise narrow multipliers. FPGA implementations exploit the absence of mantissa from the block scale and tiny mantissa multipliers, mapping these to LUTs rather than DSPs (DSP-free floating point for m3m \leq 3). ASIC support requires only a small fraction of extra SRAM and adder width for block sizes N=16,32N=16,32; the combination of features enables area and energy scaling not possible with INT or prior FP formats (Samson et al., 2024).

MXFP quantization is now supported in open-source CUDA/PyTorch (OCP MX reference, Brevitas integration) as drop-ins for model quantization pipelines, with scale computation and block grouping matching hardware-execution patterns (Rouhani et al., 2023, Samson et al., 2024).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXFP Formats.