MXFP4: 4-Bit Floating-Point Microscaling

Updated 19 November 2025

Microscaling FP4 (MXFP4) is a 4-bit floating-point quantization format that uses shared block exponents to balance extreme data compaction with a wide dynamic range for neural networks.
It employs block-wise shared scaling, stochastic rounding, and Hadamard transforms to mitigate quantization artifacts, ensuring stable performance in large language and vision models.
MXFP4 has become central to hardware, training, and post-training quantization research, enabling 4–5× acceleration and enhanced memory efficiency on next-generation GPU and NPU architectures.

Microscaling FP4 (MXFP4) is a 4-bit floating-point quantization format that, through block-wise shared scaling, enables substantial improvements in computational throughput and memory efficiency for deep neural networks, notably LLMs and @@@@2@@@@. Its design strategically balances the extreme data compaction of low-precision arithmetic with a dynamic range suited for real-world neural workloads, leveraging per-block exponent sharing to mitigate the quantization artifacts that would otherwise cripple direct 4-bit FP deployments. MXFP4 has rapidly become central to hardware, training, and post-training quantization research in the wake of widespread support for FP4 microscaling on next-generation GPU and NPU architectures.

1. Data Format Specification and Representation

MXFP4 (Microscaling FP4, E2M1) encodes each real value using a 4-bit floating-point element and augments this packing with a shared block exponent covering contiguous groups of typically 16 or 32 elements. The element-level code includes:

1 sign bit ( $s$ )
2 exponent bits ( $e$ ), bias 1
1 mantissa bit ( $m$ )

For a block of $k$ elements, all values share an 8-bit group exponent $E_s$ (commonly stored in FP8 (E8M0) or E4M3 formats). The decoded value is:

$x_i = (-1)^{s_i} \times 2^{(E_s - \mathrm{Bias}_s) + (e_i - 1)} \times (1.0 + 0.5 \cdot m_i)$

where $E_s \in [0, 255]$ , providing a dynamic range of $2^{-128} \ldots 2^{130}$ in typical microkernel and accelerator implementations (Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025). The shared exponent allows for subnormal representation through group scaling, which is essential for handling the large dynamic range and kurtosis of neural activations.

Each 4-bit element is reconstructed using the block's scale:

$\widehat{x}_i = s \times q_i$

where $q_i \in \{ \text{all possible 4-bit E2M1 codes} \}$ , $s = 2^{E_s - 1}$ in pure E8M0 scale, or $s$ quantized in the selected block FP8 format (Gorodecky et al., 5 Nov 2024, Egiazarian et al., 27 Sep 2025). Storage overhead is typically 4.25 bits per value ($4$ per element, $8$ per 32-element block).

2. Quantization Algorithm, Error Properties, and Rounding

The MXFP4 quantization process is blockwise and comprises three principal steps:

Partition the tensor into non-overlapping groups of $k$ elements.
For each block, compute the shared scale $s$ as a power-of-two or high-resolution FP8 covering the maximum absolute value in the block.
Quantize each element individually by dividing by $s$ and rounding to the nearest representable FP4 code (using either deterministic “round-to-nearest” or stochastic rounding for unbiasedness in gradients and backward passes).

The error per quantized value is proportional to $0.25s$ (the FP4 machine epsilon) (Gorodecky et al., 5 Nov 2024, Cuyckens et al., 28 May 2025). However, when outliers exist within a block, $s$ becomes large, and the precision with which the remaining elements are represented decreases dramatically.

Stabilization strategies include:

Stochastic rounding (SR): Rounds each value to neighboring codes with probability proportional to distance, ensuring unbiased gradient estimates and controlling drift during training (Tseng et al., 27 Feb 2025, Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025).
Hadamard block transforms: Applies random orthogonal transforms to spread outliers and reduce variance before quantization. This is most effective for large block sizes ( $k \geq 32$ ); for smaller blocks (e.g., NVFP4, $k=16$ ), benefits are diminished (Tseng et al., 27 Feb 2025, Egiazarian et al., 27 Sep 2025).
Tensor/global scaling: A global max across the tensor further suppresses errors from block-level outliers (Hu et al., 22 Sep 2025).

In post-training quantization (PTQ), hybrid methods such as MR-GPTQ (blockwise Hadamard rotations with groupwise scale search) can recover most of the accuracy loss observed in naive MXFP4 quantizers (Egiazarian et al., 27 Sep 2025).

3. Applications in Neural Network Training and Inference

MXFP4 is used in both training and inference across a range of domains:

LLM Training: Recent work has demonstrated near-lossless fully quantized training of GPT and LLaMA-style LLMs, with gradients, activations, and weights stored and manipulated in MXFP4 or its variants (NVFP4). These schemes often deploy stochastic rounding for gradients/updates and round-to-nearest for forward passes. Notably, group sizes and scale format selection (e.g., block size 16 for NVFP4, E4M3 scale) are critical for stability in large LLMs (Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025, Wang et al., 28 Jan 2025).
Vision Transformers: MXFP4 with stabilization methods (e.g., Q-EMA, Q-Ramping) enables sub-1% accuracy loss compared to BF16, provided oscillation of quantized weights is actively countered (Chen et al., 28 Feb 2025).
Diffusion Transformers: FP4 with hybrid groupwise scaling, in conjunction with outlier-mitigation via Hadamard transforms, achieves sub-0.2 sFID degradation and $>5\times$ speedup for DiT models, even without fine-tuning (Liu et al., 30 May 2024).
Attention Acceleration: MXFP4 enables 4-5x speedup over FP16 in attention (e.g., SageAttention3) with negligible quality loss; here, block size selection for FP4 microscaling is typically 16 or 32 (Zhang et al., 16 May 2025).

In inference, careful allocation of precision budget via channel-wise quantization thresholding maximizes the efficiency/accuracy trade-off, and mixed-precision kernels (e.g., MicroMix) dynamically select FP4, FP6, or FP8 per channel or block (Liu et al., 4 Aug 2025).

4. Hardware Architectures and Implementation

MXFP4 formats are natively supported in state-of-the-art hardware platforms, including NVIDIA Blackwell, AMD MI400, AWS Trainium, and recent neural processing units (NPUs) (Cuyckens et al., 28 May 2025, Cuyckens et al., 9 Nov 2025). Key features and techniques:

Sub-word parallelism: E2M1 FP4 arithmetic decomposes into arrays of 2×2 integer multipliers and 2-bit adders, maximizing lane utilization.
Block exponent management: Block exponent distribution/reduction logic aligns group mantissas for accumulation, with typical grouping along tile, row, or channel axes.
Hybrid accumulator trees: Mixed-precision accumulation (e.g., 16- or 24-bit mantissa) ensures adder errors remain well below quantization noise, yielding high throughput (up to 4× INT8-mode) and energy efficiency ( $\sim 4000$ GOPS/W in MXFP4 mode) (Cuyckens et al., 9 Nov 2025, Cuyckens et al., 28 May 2025).

Resource utilization for an FPGA-based converter (32 inputs, E2M1, no RAM or DSP) is under 2,000 LUTs, $<200$ mW at 15 MHz, zero cycle latency, and $<5\%$ worst-case relative error (Gorodecky et al., 5 Nov 2024). ASIC implementations achieve sub-mm² area cost for full GEMM crossbars in MXFP4.

The system-level integration exposes direct control over block size, precision mode, tiling, and streamers for grouped data movement and enables on-the-fly switching between MXFP4, MXFP8, MXINT8, or hybrid precision depending on workload and accuracy requirements.

5. Limitations, Extensions, and Mitigation Strategies

MXFP4's power-of-two quantized group scale addresses outliers but induces significant groupwise asymmetry error:

Outlier Amplification: A single "block-max" forces the exponent so high that the rest of the block quantizes to zero, with the outlier value itself often represented with high absolute error due to the 1-bit mantissa (Lee et al., 16 Oct 2025).
Group Asymmetry: Especially for small block sizes or nonzero-mean group distributions, a single shared scale fails to fit all elements, introducing nonzero mean-bias and increased MSE (Lee et al., 15 Nov 2024).
Incompatibility with Global Rotation: Standard rotation-based PTQ (QuaRot, SpinQuant) spreads outliers, increasing most blocks’ maximal value and thus escalating group exponent and global quantization error. Blockwise rotation resolves this by restricting rotation within each block.

Mitigation and extensions:

AMXFP4: Incorporates asymmetric scales (separate for positive and negative group entries) and doubles group scale registers per block, eliminating much of groupwise mean bias at minimal area cost (+10%) (Lee et al., 15 Nov 2024).
MX+: For MXFP4, replaces the block-max element's exponent field with additional mantissa bits, raising the effective resolution for the statistical outlier dominating the group scale. Empirically, MXFP4+ achieves a ~20 point accuracy increase in Llama-3.1-8B with <1% latency, <0.25 b/element storage penalty (Lee et al., 16 Oct 2025).
MR-GPTQ and Block Rotation PTQ: GPTQ-inspired direct minimization of MSE with fused blockwise Hadamard transformations or orthogonal rotations within each block. This approach restores the majority of performance lost to naive PoT scaling, outperforming global rotation and classic PTQ in MXFP4 settings (Egiazarian et al., 27 Sep 2025, Shao et al., 6 Nov 2025).

6. Practical Deployment Guidelines and Performance

Practical recipes and considerations for deploying MXFP4 include:

Scale format selection: E8M0 (PoT) for maximal range, E4M3 (NVFP4) or UE5M3 for finer granularity in LLMs above 1B parameters (Hu et al., 22 Sep 2025, Chmiel et al., 25 May 2025).
Block size: $k=16$ (NVFP4) for enhanced accuracy; $k=32$ (MXFP4) for standard deployment; $k=64$ for hardware-optimized workloads with minor degradation.
Rounding methods: Round-to-nearest for forward passes; stochastic rounding is essential for gradient updates in training (Wang et al., 28 Jan 2025, Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025).
Hadamard/block rotations: Only effective when block size is not minimal (works for $k=32$ , nontrivial benefit for MXFP4 but not NVFP4 at $k=16$ ) (Egiazarian et al., 27 Sep 2025).
Hardware support: Confirm native FP4/INT4 math and on-core support for shared exponent grouping, with attention to integration of scale buffers and quantize/dequantize microcode.

Performance benchmarks demonstrate up to 4–5× acceleration over FP16 with negligible loss in accuracy when applied to LLM serving, diffusion models, and vision transformers, provided mitigation strategies for outlier and block-max artifacts are implemented (Liu et al., 4 Aug 2025, Zhang et al., 16 May 2025, Cuyckens et al., 9 Nov 2025). Direct inference with standard MXFP4 on challenging sequences yields a severe collapse; extensions such as MXFP4+ or AMXFP4 deliver near-BF16 results at minimal additional hardware or latency cost (Lee et al., 16 Oct 2025, Lee et al., 15 Nov 2024).

Key References: