MXFP4 Quantization for Efficient LLMs
- MXFP4 quantization is a 4-bit floating-point format with an 8-bit block scaling mechanism that compresses weights and activations for efficient computation on next-gen hardware.
- It partitions tensors into blocks of 32 elements using power-of-two scaling, requiring specialized outlier mitigation strategies to counteract quantization errors.
- Advanced blockwise techniques like BRQ, BATQuant, and DuQuant++ overcome limitations of global rotations and deliver near state-of-the-art performance on large models.
MXFP4 Quantization
MXFP4 quantization refers to the application of a blockwise microscaling 4-bit floating-point format, E2M1, with an 8-bit exponent block-scale (E8M0), to compress weights and activations for efficient inference and training of LLMs and multimodal models. This format is natively supported by next-generation tensor core architectures (e.g., NVIDIA Blackwell, AMD AI MAX+), with block granularity typically set to 32 elements for maximized hardware throughput. MXFP4 delivers substantial memory and compute savings, though it requires specialized quantization and error mitigation strategies to contend with coarse power-of-two scaling and extreme sensitivity to activation outliers.
1. Format Definition and Blockwise Quantization
MXFP4 is defined as a 4-bit floating-point format (E2M1, i.e., 1-bit sign, 2-bit exponent, 1-bit mantissa) with a blockwise shared exponent encoded as an 8-bit E8M0 FP8 value. Each tensor is partitioned into disjoint blocks of 32 elements. Within a block , the shared scale is set to a power of two determined by the largest absolute value in the block: where is a small format-specific bias (usually ).
Each element in block is quantized as follows: $\mathcal{Q}(x_i) = \mathrm{clamp}\Bigl(\mathrm{round}(x_i / s_j),\, q_\min,\, q_\max\Bigr)$ where $q_\min$ and $q_\max$ are the minimum and maximum possible encoded E2M1 values. Dequantization reconstructs 0 as 1 (Lin et al., 20 Apr 2026, Shao et al., 6 Nov 2025, Li et al., 17 Mar 2026).
This blockwise approach enables hardware to efficiently shift-and-round 32-element groups, combining compact storage (4.25 bits per value including scale) with very high dynamic range due to the exponent-only format of the block scale (Taghian et al., 9 Apr 2026, Ding et al., 5 Apr 2026, Hu et al., 27 Jan 2026).
2. Challenges of Shared-Scale Block Quantization
The principal challenge in MXFP4 quantization is the "shared-scale outlier problem." Because the block scale is set by the largest-magnitude element, the presence of a single outlier inflates 2 so that the remaining 31 elements are mapped into a much smaller region of the E2M1 codebook, causing severe quantization error and dynamic range compression for the non-outlier values (Lin et al., 20 Apr 2026).
This phenomenon is particularly acute for activations and gradient tensors with heavy-tailed distributions, as a single blockwise outlier induces underutilization of available code points and exacerbates underflow/clipping for the rest of the block (Hu et al., 27 Jan 2026, Cim et al., 5 Mar 2026).
3. Limitations of Rotational and INT4-Designed Remedies
Many INT4-oriented quantization pipelines rely on global rotations (Hadamard or learned orthogonals) to "spread" outlier energy, reducing per-channel peaks and flattening the distribution prior to quantization. However, such global transforms are incompatible with the block-scaling architecture of MXFP4:
- Global rotations mix values across blocks, introducing outlier energy into otherwise regular blocks. As each block independently selects 3 after the rotation, many blocks experience inflated scale and higher overall quantization loss.
- Even blockwise Hadamard rotations do not target the specific concentration of outlier energy, treating all channels agnostically and offering limited improvement.
- Learnable global rotations are computationally expensive and, by breaking block independence, increase hardware and software complexity (Shao et al., 6 Nov 2025, Li et al., 17 Mar 2026, Lin et al., 20 Apr 2026).
Empirical results confirm that while global rotations improve INT4 accuracy, they may actually increase error and perplexity under MXFP4 (Shao et al., 6 Nov 2025, Zhang et al., 14 Jan 2026, Egiazarian et al., 27 Sep 2025).
4. Outlier-Aware Blockwise Techniques
MXFP4's format dictates that any outlier mitigation must be performed within the block, not across blocks. This has led to several blockwise or block-diagonal approaches:
Block Rotation Quantization (BRQ) and Micro-Rotated-GPTQ (MR-GPTQ): Apply a blockwise orthogonal transform (typically a 32×32 Hadamard or learned rotation) within each block before MXFP4 quantization. The rotation is fused into the quantized weight representation and can often be optimized for per-block outlier suppression, without inter-block contamination (Shao et al., 6 Nov 2025, Egiazarian et al., 27 Sep 2025).
Blockwise Affine and Clipping (BATQuant and FlatQuant): Replace global rotation with locally-learned blockwise affine transformations 4, where 5 is a small (often Kronecker-decomposed) matrix specific to block 6 (Li et al., 17 Mar 2026). Learnable blockwise clipping bounds further suppress tail outliers before quantization, yielding unimodal, compact block distributions and outperforming both Hadamard and channelwise static transforms.
DuQuant++: Advances beyond prior blockwise schemes by greedily searching for an outlier-aware orthogonal rotation per block that minimizes the peak value and outlier penalty (sum of squared large entries). The rotation block size is matched exactly to the MXFP4 block size, and cross-block variance is eliminated. Both the transformed activations and weights are quantized to MXFP4. Inference overhead is minimized to a single X→X·R rotation per layer (Lin et al., 20 Apr 2026).
| Approach | Cross-block mixing | Data adaptivity | Online cost | Performance gain |
|---|---|---|---|---|
| Hadamard BRQ | No | No | Low | Modest |
| MR-GPTQ | No | No / mild | Low | Moderate |
| FlatQuant | Yes | Yes | High | Strong (if block size = group size) |
| BATQuant | No | Yes | Low–medium | Strong |
| DuQuant++ | No | Yes, outlier-aware | Lowest | State-of-the-art |
5. Algorithmic and Hardware Considerations
MXFP4 quantization is characterized by several algorithmic design points:
- Block Size Alignment: Hardware (e.g., NVIDIA Blackwell Tensor Cores) expects group size 7, and best results are obtained when algorithmic block size matches hardware block size.
- Blockwise Scale (E8M0): Block scales are constrained to exact powers of two, simplifying exponent shifts in GEMM but inducing coarse quantization error if group extrema fall between two scale bins.
- Pre-Scale and MSE-Optimal Scaling: Pre-scaling (multiplying by a tuned 8 before scaling) and closed-form MSE-optimal scale search significantly mitigate scale-induced bias. These are recommended as a foundation in all PTQ pipelines targeting MXFP4 (Zhang et al., 14 Jan 2026).
- Blockwise Metadata: Methods such as M9XFP inject minimal per-subgroup metadata to further refine local scales or mantissas, improving accuracy with a small area/power penalty and a total effective bit-width of ≈4.5 (Hu et al., 27 Jan 2026).
- Hybrid and Mixed-Precision Recipes: Mixed-precision schedules in which only MLP up/down projections or early/late blocks are kept in FP16/FP8, with all others quantized to MXFP4, recover over 95% of FP16 accuracy (Cim et al., 5 Mar 2026).
6. Training, Inference, and Applications
Training: Pure MXFP4 training is feasible using unbiased stochastic rounding (SR) and blockwise random Hadamard transforms. SR ensures unbiased gradients, while Hadamard mixing bounds the variance introduced by outlier blocks. Weight oscillations during training can be mitigated via momentum-based EMA quantizers and adaptive ramping optimizers (Tseng et al., 27 Feb 2025, Chen et al., 28 Feb 2025).
Inference: Weight-only quantization with blockwise scale achieves negligible degradation relative to BF16/F16. For full W4A4 pipelines, calibration over 128–256 samples and blockwise affine/rotation methods deliver best results. On hardware, GEMM performance is 2–4× faster than FP16 with up to 4× memory reduction (Georganas et al., 17 Mar 2025, Lin et al., 20 Apr 2026).
Speculative Decoding and Mixed-Precision Attention: MXFP4 enables advanced inference techniques (e.g., plug-and-play draft models for speculative decoding, mixed-precision diagonal tiled attention for efficient Transformer prefill and decode) with minimal loss in accuracy or throughput (Georganas et al., 17 Mar 2025, Ding et al., 5 Apr 2026).
Scaling: Applying overflow-aware (OAS) and macro block scaling (MBS) techniques closes the ~10% end-to-end accuracy gap between vanilla MXFP4 and NVFP4 formats to below 1%, with only minor (∼6%) GEMM overhead (Chhugani et al., 30 Jan 2026).
7. Empirical Performance and Best Practices
Extensive evaluation of MXFP4 quantization on architectures such as LLaMA3-8B, Qwen3-8B, and Mistral-7B consistently shows state-of-the-art results from data-adaptive, blockwise pipelines. For example, DuQuant++ reduces perplexity and increases zero-shot accuracy by up to 1% over previous baselines (e.g., MR-GPTQ) and halves the online cost associated with rotation (Lin et al., 20 Apr 2026). FlatQuant and BATQuant both reach 96–99% relative accuracy at W4A4 precision on challenging multimodal and LLM benchmarks (Li et al., 17 Mar 2026).
Empirical guidelines include:
- Always align block size to hardware granularity (0).
- Begin with pre-scale optimization to reduce scale quantization bias.
- Employ blockwise, data-aware transformations, favoring affine or outlier-aware rotations over global orthogonals.
- For mixed-precision deployment, keep the most sensitive MLP projections and boundary blocks at higher precision.
- Monitor calibration-induced clipping and adapt scale search or metadata if >1–2% of values are clipped in calibration.
Advances such as DuQuant++ (outlier-aware rotation alignment), BATQuant (blockwise learned affine/clip), M1XFP (subgroup-scale metadata), and OAS/MBS (scale refinement) collectively establish robust recipes for accurate, efficient 4-bit FP quantization of large models on MXFP4-capable hardware (Lin et al., 20 Apr 2026, Li et al., 17 Mar 2026, Hu et al., 27 Jan 2026, Chhugani et al., 30 Jan 2026).