Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXFP4: 4-Bit Microscaling FP Format

Updated 3 July 2026
  • MXFP4 is a 4-bit microscaling floating-point format that employs a block-based structure, where 32 values share an 8-bit exponent, optimizing precision and efficiency.
  • It compresses data to 4.25 bits per element, significantly lowering memory and computation costs for AI inference in large language models.
  • Its quantization method balances scale bias, deadzone truncation, and grid noise, while leveraging hardware accelerators like NVIDIA Blackwell Tensor Cores for high performance.

MXFP4 is a 4-bit “microscaling” floating-point data format, standardized by the OCP Microscaling (MX) v1.0 specification, featuring a block-based structure in which each group of 32 values shares a single 8-bit exponent scale (E8M0), and each element is encoded in 4 bits using a sign (S=1), exponent (E=2), and mantissa (M=1) configuration. MXFP4 is widely adopted for efficient AI inference, especially LLMs, due to its ability to reduce model memory and computation costs while retaining floating-point semantics and maximizing hardware throughput on native MX-supporting accelerators such as NVIDIA Blackwell Tensor Cores and vendor-agnostic RISC-V or analog compute-in-memory backends.

1. MXFP4 Numerical Structure and Quantization

MXFP4's per-element encoding uses a 4-bit E2M1 floating-point layout:

  • Sign bit (σ): 1 bit
  • Exponent bits (E): 2 bits
  • Mantissa bit (M): 1 bit

The per-block scaling factor is an 8-bit E8M0 value (block exponent, often with bias 127). Each 32-element block XbX_b is quantized with its own shared scaling factor sbs_b: sj=2log2(maxxXjx)bs_j = 2^{\left\lfloor \log_2 (\max_{x \in X_j} |x| ) \right\rfloor - b} where bb is the exponent bias (1 for E2M1).

Element dequantization: xi=(1)σsj2e1(1+m2)x_i = (-1)^{\sigma} \cdot s_j \cdot 2^{e - 1}\cdot \left(1 + \frac{m}{2}\right) The E2M1 codebook yields the set {0,±0.5,±1,±1.5,±2,±3,±4,±6}\{0, \pm 0.5, \pm 1, \pm 1.5, \pm 2, \pm 3, \pm 4, \pm 6\} (exact finite values), and subnormals extend minimum representable positive values to 0.5. Infinity/NaN values are not supported at the element level; overflows saturate to the largest normal code.

Block quantization decomposes a tensor into non-overlapping blocks of 32 elements, computes sbs_b per block, and rounds/scales each block's values onto the E2M1 floating-point lattice. This enables extremely compact representation: 4.25 bits/element (128 bits for 32 values plus 8 bits for the shared scale).

2. Error Structure and Theoretical Properties

Quantization error in MXFP4 decomposes into:

  1. Scale bias: Rounding the block scale to a power-of-two grid (E8M0) introduces multiplicative bias, causing the quantization grid to misalign with block maxima. Expected scale error can approach 44% RMSE for large LL in deep networks, with layerwise errors accumulating multiplicatively through backpropagation and harming SGD convergence in training, or inflating numerical errors during inference (Li et al., 19 May 2026).
  2. Deadzone truncation: Values with xb,i/sb<0.25|x_{b,i}|/s_b^*\lt0.25 are quantized to zero (deadzone). The deadzone probability is high in Laplace-like weight distributions; empirical studies estimate 9% of weights may vanish per block.
  3. Grid noise: The coarse 4-bit grid means per-element rounding error is O(sb/2)\mathcal{O}(s_b/2) for the smallest normal, increasing in blocks containing outliers due to larger sbs_b0. Empirical analyses confirm that total MSE is shaped primarily by scale quantization error and deadzone truncation, with grid noise forming an irreducible floor (Li et al., 19 May 2026, Chhugani et al., 30 Jan 2026).

3. Outlier Sensitivity and Block-Level Dynamics

A central challenge for MXFP4 is that a single extreme “outlier” within a block can inflate the shared scale sbs_b1 by several orders of magnitude, which renders the effective dynamic range for the other 31 elements extremely coarse. The worst-case per-element error is bounded by sbs_b2, and a single activation spike can force all normal activations in the block onto a coarse grid (Lin et al., 20 Apr 2026, Shao et al., 6 Nov 2025). This is particularly problematic in transformer LLMs, where down-projection and up-projection layers are highly sensitive: diagnostic studies show sbs_b3PPL of +8 with FP16 protection on these layers versus full MXFP4 quantization (Cim et al., 5 Mar 2026).

Empirically, block-size effects are critical. MXFP4 with group size 32 yields worse deadzone and scale error than NVFP4 (group size 16, higher scale mantissa), explaining the larger average performance drop of MXFP4 relative to competing block-FP4 formats (Egiazarian et al., 27 Sep 2025, Hu et al., 27 Jan 2026).

4. Post-Training Quantization, Block-Wise Transformations, and Error Mitigation

MXFP4 quantization presents unique challenges; methods developed for int4 or tensor-level floating-point quantization often collapse in accuracy due to format mismatch. Leading findings include:

  • Global orthogonal rotations (e.g., QuaRot, SpinQuant): These propagate outlier energy across blocks, inflating regular-block scales and resulting in severe codebook underutilization (“bimodal clusters”), with accuracy often <90% recovery (Li et al., 17 Mar 2026, Zhang et al., 14 Jan 2026, Shao et al., 6 Nov 2025).
  • Block-only transformations: Blockwise rotations (BRQ), outlier-aware greedy blockwise Givens or Householder constructions (DuQuant++), and block-diagonal learnable affine transforms (BATQuant) operate strictly inside each block. These prevent cross-block outlier propagation and permit both smoother value distributions and finer codebook coverage, yielding state-of-the-art accuracy in aggressive W4A4 quantization regimes (Lin et al., 20 Apr 2026, Shao et al., 6 Nov 2025, Li et al., 17 Mar 2026).
  • Learnable blockwise clipping: Fine-grained learnable clipping within each 32-element block suppresses residual extreme values without biasing the rest of the codebook, further reducing the effective quantization error (Li et al., 17 Mar 2026).
  • Affine histogram shaping: Relaxing the orthogonality constraint (as in BATQuant) favors compact, unimodal block-distributions to maximize codebook utilization and minimize deadzone truncation (Li et al., 17 Mar 2026, Xu et al., 19 May 2026).
  • Macro-block scaling and metadata augmentation: Techniques such as Overflow-Aware Scaling (OAS), Macro Block Scaling (MBS), and element/subgroup-level metadata (Msbs_b4XFP) allow selective increase of block scale precision or mantissa at negligible hardware cost, shrinking the accuracy gap to NVFP4 or BF16 by more than 2–5x (Chhugani et al., 30 Jan 2026, Hu et al., 27 Jan 2026, Lee et al., 16 Oct 2025).
Method Outlier Mitigation Block Coupling Max. Recovery (W4A4)
Global Rotation (QuaRot) Spreads outliers Cross-block <90%
Block Rotation (BRQ, DuQuant++, BATQuant) Localizes None 95–99%
Macro Block Scaling (OAS/MBS) Selective improve None <1% from NVFP4
Metadata Augment (Msbs_b5XFP) Top-1 correction None 70% loss reduction

Blockwise strategies universally dominate rotation-based ones when using PoT block scaling and 4-bit codebooks.

5. Hardware and Software Support

MXFP4 adoption is driven by its extremely efficient hardware mapping:

  • NVIDIA Blackwell Tensor Cores, AMD Ryze, Intel AMX, Apple M-series, and RISC-V VMXDOTP extensions all support native MXFP4 GEMM, with per-block E8M0 exponent and 32×4-bit packed data (Wipfli et al., 5 Mar 2026, Lin et al., 20 Apr 2026, Liu et al., 4 Aug 2025).
  • Compact data layout: Storage is 4.25 bits/element (32×4 bit values plus an 8-bit scale per block).
  • Software frameworks: PyTorch TorchAO, DeepSpeed, vLLM, ML-SpecQD, and custom CUDA or AVX2 microkernels offer flexible kernel deployment and runtime quantization, often using tensor subclassing to represent MXFP4 natively within computational graphs (Or et al., 21 Jul 2025, Georganas et al., 17 Mar 2025, Liu et al., 4 Aug 2025).
  • Compute-in-memory: MXFormer demonstrates analog acceleration with CTT arrays, full digital/analog pipelines with per-block exponent alignment, and 10-bit ADC sampling, yielding 3–4× area and energy efficiency improvements at <1% accuracy loss (Karfakis et al., 12 Feb 2026).
  • ISA extensions: RVV 1.0 VMXDOTP supports block-FP dot products with software-definable block sizes at near-peak vector utilization; ~4.5× energy efficiency improvement vs. software emulation (Wipfli et al., 5 Mar 2026).

6. Training and Deployment in LLMs

MXFP4 is used both for inference-only scenarios and, with significant algorithmic innovation, for efficient training:

7. Limitations and Format Evolution

Despite hardware and bandwidth advantages, MXFP4 with naive quantization displays large accuracy deficits (5–15% average accuracy loss, or PPL gaps >1.5) versus FP16/BF16, and is consistently outperformed by formats with fine-grained scaling or more metadata (e.g., NVFP4, SMX4, MX+). The main limiting factors are:

  • Irreducible error floor: Even after macro-block scaling and outlier correction, grid noise and deadzone truncation set a lower bound on quantization error (Li et al., 19 May 2026).
  • Block sensitivity: Small transformer blocks (MLP up/down, early/late blocks) are disproportionately sensitive and often require full-precision fallback or mixed-precision assignment to avoid output collapse (Cim et al., 5 Mar 2026).
  • Scale misalignment: Power-of-two scale grids introduce systematic bias relative to the ideal real-valued maxima (Hu et al., 27 Jan 2026). A promising direction is low-overhead metadata (e.g., Msbs_b6XFP’s 0.25 bits/element), hybrid block sizes, and codebook-aligned intra-block transformations (e.g., TORQ), which can close the downstream quality gap to within ≤1–2% for major LLM tasks (Xu et al., 19 May 2026, Hu et al., 27 Jan 2026, Chhugani et al., 30 Jan 2026).

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXFP4.