MXFP4: 4-Bit Microscaling FP Format

Updated 3 July 2026

MXFP4 is a 4-bit microscaling floating-point format that employs a block-based structure, where 32 values share an 8-bit exponent, optimizing precision and efficiency.
It compresses data to 4.25 bits per element, significantly lowering memory and computation costs for AI inference in large language models.
Its quantization method balances scale bias, deadzone truncation, and grid noise, while leveraging hardware accelerators like NVIDIA Blackwell Tensor Cores for high performance.

MXFP4 is a 4-bit “microscaling” floating-point data format, standardized by the OCP Microscaling (MX) v1.0 specification, featuring a block-based structure in which each group of 32 values shares a single 8-bit exponent scale (E8M0), and each element is encoded in 4 bits using a sign (S=1), exponent (E=2), and mantissa (M=1) configuration. MXFP4 is widely adopted for efficient AI inference, especially LLMs, due to its ability to reduce model memory and computation costs while retaining floating-point semantics and maximizing hardware throughput on native MX-supporting accelerators such as NVIDIA Blackwell Tensor Cores and vendor-agnostic RISC-V or analog compute-in-memory backends.

1. MXFP4 Numerical Structure and Quantization

MXFP4's per-element encoding uses a 4-bit E2M1 floating-point layout:

Sign bit (σ): 1 bit
Exponent bits (E): 2 bits
Mantissa bit (M): 1 bit

The per-block scaling factor is an 8-bit E8M0 value (block exponent, often with bias 127). Each 32-element block $X_b$ is quantized with its own shared scaling factor $s_b$ : $s_j = 2^{\left\lfloor \log_2 (\max_{x \in X_j} |x| ) \right\rfloor - b}$ where $b$ is the exponent bias (1 for E2M1).

Element dequantization: $x_i = (-1)^{\sigma} \cdot s_j \cdot 2^{e - 1}\cdot \left(1 + \frac{m}{2}\right)$ The E2M1 codebook yields the set $\{0, \pm 0.5, \pm 1, \pm 1.5, \pm 2, \pm 3, \pm 4, \pm 6\}$ (exact finite values), and subnormals extend minimum representable positive values to 0.5. Infinity/NaN values are not supported at the element level; overflows saturate to the largest normal code.

Block quantization decomposes a tensor into non-overlapping blocks of 32 elements, computes $s_b$ per block, and rounds/scales each block's values onto the E2M1 floating-point lattice. This enables extremely compact representation: 4.25 bits/element (128 bits for 32 values plus 8 bits for the shared scale).

2. Error Structure and Theoretical Properties

Quantization error in MXFP4 decomposes into:

Scale bias: Rounding the block scale to a power-of-two grid (E8M0) introduces multiplicative bias, causing the quantization grid to misalign with block maxima. Expected scale error can approach 44% RMSE for large $L$ in deep networks, with layerwise errors accumulating multiplicatively through backpropagation and harming SGD convergence in training, or inflating numerical errors during inference (Li et al., 19 May 2026).
Deadzone truncation: Values with $|x_{b,i}|/s_b^*\lt0.25$ are quantized to zero (deadzone). The deadzone probability is high in Laplace-like weight distributions; empirical studies estimate 9% of weights may vanish per block.
Grid noise: The coarse 4-bit grid means per-element rounding error is $\mathcal{O}(s_b/2)$ for the smallest normal, increasing in blocks containing outliers due to larger $s_b$ 0. Empirical analyses confirm that total MSE is shaped primarily by scale quantization error and deadzone truncation, with grid noise forming an irreducible floor (Li et al., 19 May 2026, Chhugani et al., 30 Jan 2026).

3. Outlier Sensitivity and Block-Level Dynamics

A central challenge for MXFP4 is that a single extreme “outlier” within a block can inflate the shared scale $s_b$ 1 by several orders of magnitude, which renders the effective dynamic range for the other 31 elements extremely coarse. The worst-case per-element error is bounded by $s_b$ 2, and a single activation spike can force all normal activations in the block onto a coarse grid (Lin et al., 20 Apr 2026, Shao et al., 6 Nov 2025). This is particularly problematic in transformer LLMs, where down-projection and up-projection layers are highly sensitive: diagnostic studies show $s_b$ 3PPL of +8 with FP16 protection on these layers versus full MXFP4 quantization (Cim et al., 5 Mar 2026).

Empirically, block-size effects are critical. MXFP4 with group size 32 yields worse deadzone and scale error than NVFP4 (group size 16, higher scale mantissa), explaining the larger average performance drop of MXFP4 relative to competing block-FP4 formats (Egiazarian et al., 27 Sep 2025, Hu et al., 27 Jan 2026).

4. Post-Training Quantization, Block-Wise Transformations, and Error Mitigation

MXFP4 quantization presents unique challenges; methods developed for int4 or tensor-level floating-point quantization often collapse in accuracy due to format mismatch. Leading findings include:

Global orthogonal rotations (e.g., QuaRot, SpinQuant): These propagate outlier energy across blocks, inflating regular-block scales and resulting in severe codebook underutilization (“bimodal clusters”), with accuracy often <90% recovery (Li et al., 17 Mar 2026, Zhang et al., 14 Jan 2026, Shao et al., 6 Nov 2025).
Block-only transformations: Blockwise rotations (BRQ), outlier-aware greedy blockwise Givens or Householder constructions (DuQuant++), and block-diagonal learnable affine transforms (BATQuant) operate strictly inside each block. These prevent cross-block outlier propagation and permit both smoother value distributions and finer codebook coverage, yielding state-of-the-art accuracy in aggressive W4A4 quantization regimes (Lin et al., 20 Apr 2026, Shao et al., 6 Nov 2025, Li et al., 17 Mar 2026).
Learnable blockwise clipping: Fine-grained learnable clipping within each 32-element block suppresses residual extreme values without biasing the rest of the codebook, further reducing the effective quantization error (Li et al., 17 Mar 2026).
Affine histogram shaping: Relaxing the orthogonality constraint (as in BATQuant) favors compact, unimodal block-distributions to maximize codebook utilization and minimize deadzone truncation (Li et al., 17 Mar 2026, Xu et al., 19 May 2026).
Macro-block scaling and metadata augmentation: Techniques such as Overflow-Aware Scaling (OAS), Macro Block Scaling (MBS), and element/subgroup-level metadata (M $s_b$ 4XFP) allow selective increase of block scale precision or mantissa at negligible hardware cost, shrinking the accuracy gap to NVFP4 or BF16 by more than 2–5x (Chhugani et al., 30 Jan 2026, Hu et al., 27 Jan 2026, Lee et al., 16 Oct 2025).

Method	Outlier Mitigation	Block Coupling	Max. Recovery (W4A4)
Global Rotation (QuaRot)	Spreads outliers	Cross-block	<90%
Block Rotation (BRQ, DuQuant++, BATQuant)	Localizes	None	95–99%
Macro Block Scaling (OAS/MBS)	Selective improve	None	<1% from NVFP4
Metadata Augment (M $s_b$ 5XFP)	Top-1 correction	None	70% loss reduction

Blockwise strategies universally dominate rotation-based ones when using PoT block scaling and 4-bit codebooks.

5. Hardware and Software Support

MXFP4 adoption is driven by its extremely efficient hardware mapping:

NVIDIA Blackwell Tensor Cores, AMD Ryze, Intel AMX, Apple M-series, and RISC-V VMXDOTP extensions all support native MXFP4 GEMM, with per-block E8M0 exponent and 32×4-bit packed data (Wipfli et al., 5 Mar 2026, Lin et al., 20 Apr 2026, Liu et al., 4 Aug 2025).
Compact data layout: Storage is 4.25 bits/element (32×4 bit values plus an 8-bit scale per block).
Software frameworks: PyTorch TorchAO, DeepSpeed, vLLM, ML-SpecQD, and custom CUDA or AVX2 microkernels offer flexible kernel deployment and runtime quantization, often using tensor subclassing to represent MXFP4 natively within computational graphs (Or et al., 21 Jul 2025, Georganas et al., 17 Mar 2025, Liu et al., 4 Aug 2025).
Compute-in-memory: MXFormer demonstrates analog acceleration with CTT arrays, full digital/analog pipelines with per-block exponent alignment, and 10-bit ADC sampling, yielding 3–4× area and energy efficiency improvements at <1% accuracy loss (Karfakis et al., 12 Feb 2026).
ISA extensions: RVV 1.0 VMXDOTP supports block-FP dot products with software-definable block sizes at near-peak vector utilization; ~4.5× energy efficiency improvement vs. software emulation (Wipfli et al., 5 Mar 2026).

6. Training and Deployment in LLMs

MXFP4 is used both for inference-only scenarios and, with significant algorithmic innovation, for efficient training:

Training: Bias and convergence issues are solved using unbiased stochastic rounding, blockwise pre-scaling (e.g., 3/4), random Hadamard transforms to bound SR variance, and truncation-free scaling (Tseng et al., 27 Feb 2025, Chen et al., 28 Feb 2025). Co-designs for MoE architectures enable critical memory and communication compression (Zhang et al., 3 Mar 2026).
Inference: State-of-the-art LLM quantization typically uses W4A8 or mixed-precision pathways, with robust recovery in vision, reasoning, and zero-shot tasks once block-local error structures are addressed (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026, Cim et al., 5 Mar 2026). Strict block rotation eliminates the cross-block variance imbalance and codebook collapse responsible for MXFP4's naive underperformance.
Speculative decoding: MXFP4 enables plug-and-play drafts in multi-level speculative decoding pipelines, directly cast from full-precision weights, yielding 1.8–2.7× end-to-end speedup over BF16, with negligible accuracy loss (Georganas et al., 17 Mar 2025).

7. Limitations and Format Evolution

Despite hardware and bandwidth advantages, MXFP4 with naive quantization displays large accuracy deficits (5–15% average accuracy loss, or PPL gaps >1.5) versus FP16/BF16, and is consistently outperformed by formats with fine-grained scaling or more metadata (e.g., NVFP4, SMX4, MX+). The main limiting factors are:

Irreducible error floor: Even after macro-block scaling and outlier correction, grid noise and deadzone truncation set a lower bound on quantization error (Li et al., 19 May 2026).
Block sensitivity: Small transformer blocks (MLP up/down, early/late blocks) are disproportionately sensitive and often require full-precision fallback or mixed-precision assignment to avoid output collapse (Cim et al., 5 Mar 2026).
Scale misalignment: Power-of-two scale grids introduce systematic bias relative to the ideal real-valued maxima (Hu et al., 27 Jan 2026). A promising direction is low-overhead metadata (e.g., M $s_b$ 6XFP’s 0.25 bits/element), hybrid block sizes, and codebook-aligned intra-block transformations (e.g., TORQ), which can close the downstream quality gap to within ≤1–2% for major LLM tasks (Xu et al., 19 May 2026, Hu et al., 27 Jan 2026, Chhugani et al., 30 Jan 2026).

References:

(Lin et al., 20 Apr 2026) DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
(Li et al., 17 Mar 2026) BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization
(Shao et al., 6 Nov 2025) Block Rotation is All You Need for MXFP4 Quantization
(Li et al., 19 May 2026) Decomposing MXFP4 quantization error for LLM reinforcement learning
(Hu et al., 27 Jan 2026) M $s_b$ 7XFP: A Metadata-Augmented Microscaling Data Format…
(Chhugani et al., 30 Jan 2026) Unveiling the Potential of Quantization with MXFP4…
(Egiazarian et al., 27 Sep 2025) Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
(Zhang et al., 14 Jan 2026) Benchmarking Post-Training Quantization of LLMs under Microscaling Floating Point Formats
(Zhang et al., 3 Mar 2026) Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs
(Cim et al., 5 Mar 2026) Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4
(Xu et al., 19 May 2026) TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization
(Georganas et al., 17 Mar 2025) ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
(Chen et al., 28 Feb 2025) Oscillation-Reduced MXFP4 Training for Vision Transformers
(Wipfli et al., 5 Mar 2026) VMXDOTP: A RISC-V Vector ISA Extension for Efficient Microscaling (MX) Format Acceleration
(Karfakis et al., 12 Feb 2026) MXFormer: A Microscaling Floating-Point Charge-Trap Transistor Compute-in-Memory Transformer Accelerator
(Liu et al., 4 Aug 2025) MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for LLMs
(Or et al., 21 Jul 2025) TorchAO: PyTorch-Native Training-to-Serving Model Optimization
(Vasilev, 8 Jun 2026) An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors…