MXFP4 Datatype: 4-bit Floating-Point for AI

Updated 29 October 2025

MXFP4 datatype is a 4-bit floating-point representation that uses an E2M1 per-element format with an 8-bit shared scale to achieve efficient low-precision computation.
It employs block quantization by partitioning tensors into 32-element blocks and applying power-of-two scaling to optimize inference on edge devices and GPUs.
Its performance trade-offs include quantization error from coarse scale grids and outlier sensitivity, driving innovations like MXFP4+ and MR-GPTQ.

The MXFP4 datatype is a 4-bit floating-point numerical representation designed for efficient, low-precision computation in AI workloads. MXFP4 has become a focal point for both hardware and algorithmic innovation, supporting a spectrum of applications from inference acceleration in edge devices to LLM quantization on advanced GPUs. The key feature of MXFP4 is its microscaling design: numerical values are divided into blocks, each block sharing a scale exponent to enable a broad dynamic range even at only four bits per element. This approach combines compact storage and high throughput with hardware-friendly implementations, but also imposes significant format-specific constraints on quantization fidelity and error management.

1. Format Definition and Structural Properties

MXFP4 implements a block floating-point structure using the E2M1 format (1 sign bit, 2 exponent bits, 1 mantissa bit) per element and an 8-bit shared scale (E8M0) per block, typically grouping 32 elements. Representation of an element involves:

Bit structure (per element):
- 1 sign bit
- 2 exponent bits (with bias)
- 1 mantissa bit
Block structure:
- 32 elements, each 4 bits
- 1 shared scaling factor encoded as E8M0 (power-of-two only)
- Total: 128 bits for data, 8 bits for scale per block

The mathematical encoding for reconstructing value $x_i$ from quantized codeword $q_i$ is:

$x_i = s \cdot \textrm{FP4}(q_i)$

where $s = 2^{e-127}$ for $e$ the 8-bit shared exponent and $\textrm{FP4}(\cdot)$ denotes decoding of the 4-bit E2M1 codeword. The representable grid is non-uniform and enables values such as $\{-6, -4, -3, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 3, 4, 6\}$ in the default configuration (Liu et al., 23 Jul 2025, Gorodecky et al., 5 Nov 2024).

2. Quantization Methodology and Workflow

Blockwise quantization in MXFP4 consists of:

Partitioning: Split input tensor into blocks of 32.
Scale selection: Compute the maximum absolute value in the block; round this to the nearest representable E8M0 (power-of-two) to determine the shared scale. In formal terms:

$\mathrm{shared\_exp} = \max_i \left( \left\lfloor \log_2 |x_i| \right\rfloor \right) - e_{\mathrm{max}}$

$s = 2^{\mathrm{shared\_exp}}$

Elementwise quantization: Each value is divided by the scale, mapped to the nearest E2M1 codeword, and if necessary, clamped.
Packing & storage: The resulting 4-bit codes and an 8-bit scale per block are compactly stored.

This branch of block floating-point quantization is compatible with efficient hardware implementations; scale multiplication can be realized via bit-shifting (due to power-of-two scaling), avoiding high-latency multiply operations (Gorodecky et al., 5 Nov 2024, Lokhande et al., 16 Dec 2024).

3. Accuracy, Outlier Handling, and Quantization Error

The efficacy of MXFP4 depends critically on the characteristics of the activations or weights in a block. Its non-uniform quantization grid enhances representation of long-tailed distributions common in LLM weights/activations, outperforming uniform INT4 in fidelity for these cases (Liu et al., 23 Jul 2025). However, MXFP4 imposes two key sources of error:

Coarse scale quantization: The E8M0 power-of-two scale introduces substantial error if the optimal scale is not a power-of-two.
Outlier dominance: When a single outlier is present in a block, the shared scale becomes suboptimal for the rest of the block’s values, leading to quantization of non-outliers to zero (or high error).

Mean squared error ( $MSE$ ) in MXFP4 can therefore be decomposed into:

$MSE(G) = \mathbb{E}\left[(X_1 - \hat{X}_1)^2\right] + \cdots + \mathbb{E}\left[(X_{32} - \hat{X}_{32})^2\right]$

with $MSE_{\text{top}}$ (block max) and $MSE_{\text{non-max}}$ (non-max elements) showing that the block max precision is a critical failure point (Lee et al., 16 Oct 2025, Egiazarian et al., 27 Sep 2025).

4. Algorithmic Extensions and Recent Innovations

Advanced quantization algorithms have emerged to address the limitations of MXFP4:

Micro-Rotated-GPTQ (MR-GPTQ) (Egiazarian et al., 27 Sep 2025): Applies blockwise Hadamard rotations before quantization, normalizing within-block distributions, and thus making the maximum value less dominant and distributing MSE more evenly.
MX+ (MXFP4+) (Lee et al., 16 Oct 2025): Repurposes the exponent bits of the block max element as extra mantissa bits, increasing its precision with negligible memory overhead—reaching near-MXFP6 accuracy at MXFP4 storage cost.
Redundant Zero Remapping (RaZeR) (Chen et al., 6 Jan 2025): For FP4, replaces the redundant negative zero bit pattern with customizable per-block special values; improves effective representation and model accuracy, especially beneficial at very low bitwidths.
AMXFP4 (Lee et al., 15 Nov 2024): Introduces groupwise asymmetric scaling, encoding differing positive/negative scales to counteract the symmetry assumption in MXFP4 and address the group-mean shift caused by microscaling.

5. Hardware and Performance Considerations

The block-based structure of MXFP4 is natively or efficiently supported in modern hardware architectures:

High-throughput SIMD datapaths: Up to 16×-32× higher throughput per PE compared to standard 32-bit or 16-bit architectures (Lokhande et al., 16 Dec 2024).
Energy and area efficiency: Reported energy efficiency can reach 8.42 GOPS/W (CIFAR-100 workload, ASIC) and iterative designs enable ultra-compact implementations (e.g. 78µm² area for 4-bit mode) (Lokhande et al., 16 Dec 2024).
Direct-cast quantization: MXFP4 supports plug-and-play quantization from BP16/FP32 models without retraining or calibration, which is critical for fast LLM deployment scenarios (Georganas et al., 17 Mar 2025).

The power-of-two scale in MXFP4 enables efficient on-the-fly dequantization using bit shifts, which reduces critical path length in hardware inference and accelerates bandwidth-limited workloads, such as transformer inference or vision model pre-training (Lokhande et al., 16 Dec 2024, Tseng et al., 27 Feb 2025).

6. Applications, Limitations, and Format-Specific Challenges

MXFP4 is widely applied in LLM quantization, speculative decoding with quantized drafts, and quantized training of large vision/LLMs. Key use cases include:

LLM inference: As a weight-only quantization format, MXFP4 enables significant memory and bandwidth savings without accuracy loss under advanced algorithms (Georganas et al., 17 Mar 2025).
Transformer/Vision models: Per-group scaling allows accurate training and inference at 4-bit precision beyond what per-tensor scaling achieves (Chen et al., 28 Feb 2025).
Training acceleration: Used for GEMMs in backward passes, MXFP4 achieves $>1.3\times$ speedup over FP8/BF16 at negligible accuracy loss with stochastic rounding and transforms (Tseng et al., 27 Feb 2025).

Notable limitations arise if format-specific pitfalls are not addressed:

Power-of-two scale grid can induce marked quantization error without format-tailored grid search or pre-processing (Egiazarian et al., 27 Sep 2025).
Sensitivity to outliers: A single extreme value in a block can dominate the chosen scale, harming the quantization of the other 31 elements. MX+ directly targets this issue (Lee et al., 16 Oct 2025).
Non-universal transferability: INT4-optimized quantization methods (e.g., optimized pre-rotation) do not necessarily generalize; scaling-based pre-processing is more effective for MXFP4 (Liu et al., 23 Jul 2025).

Format	Per-Element Bits	Group Size	Scale Format	Outlier Mitigation	Performance
MXFP4	4 (E2M1)	32	E8M0 (PoT)	None (base)	Highest eff.
MXFP4+ (MX+)	4 (BM=E2M3)	32	E8M0	BM extra mantissa	Near-MXFP6 acc
NVFP4	4 (E2M1)	16	E4M3 (full)	Outlier "promotion"	Highest acc

7. Outlook and Future Research Directions

MXFP4 and its variants are now standard in major AI accelerators, including NVIDIA Blackwell and AMD XR architectures (Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025). Directions for future work include:

Format-specific quantization strategies, such as rotation-optimized GPTQ or Hadamard transforms, to minimize scale quantization errors and outlier-induced losses.
Extensions like MXFP4+ (MX+) that repurpose exponent bits to extend mantissa for outliers, preserving bandwidth efficiency and boosting accuracy at 4 bits (Lee et al., 16 Oct 2025).
Groupwise asymmetric and per-block learning of scale/offsets, as in AMXFP4, to further mitigate microscaling-induced asymmetry (Lee et al., 15 Nov 2024).
Hardware-software co-design, ensuring support for flexible scale encoding (beyond strict power-of-two), and direct-cast quantization support in deployment toolchains.

A plausible implication is that MXFP4, when paired with advanced, format-aware quantization (such as MR-GPTQ), is capable of narrowing the gap with higher-precision or more flexible formats, but naive application can result in severe accuracy loss. The emergent research consensus is that MXFP4 is not universally superior to INT4 or NVFP4; its effectiveness is contingent on contextually tailored algorithms and hardware support.