MXFP4 Training Methods
- MXFP4 training is a quantization technique that employs a 4-bit floating-point format (E2M1) with group-wise scaling to drastically reduce computational cost while preserving accuracy.
- It integrates stochastic rounding and variance reduction transforms, like the Random Hadamard Transform, to mitigate quantization noise and support stable gradient propagation.
- Empirical results show MXFP4 enables 1.8–2.6× training speedups in language models and near-lossless ImageNet performance for vision tasks.
Microscaling FP4 (MXFP4) training refers to training neural networks—especially large language and vision models—using a 4-bit floating-point format (E2M1) combined with efficient group-wise scaling to achieve aggressive reductions in computational cost and bandwidth without significant loss in accuracy. The technique is anchored in both hardware support (e.g., NVIDIA Blackwell GPUs, FPGAs) and recent algorithmic advances that address the unique quantization noise and precision challenges of sub-8-bit floating-point computation.
1. MXFP4 Format and Quantization Scheme
MXFP4 is a 4-bit floating-point format based on a standard E2M1 layout (1 sign, 2 exponent, 1 mantissa bit), with an explicit block-level scaling factor applied to each group of elements:
- Element encoding:
- Each value: 1 sign bit, 2 exponent bits (bias=1), 1 mantissa bit
- Representable values: {–6, –4, –3, –2, –1.5, –1, –0.5, 0, 0.5, 1, 1.5, 2, 3, 4, 6}
- Microscaling scheme:
- Contiguous blocks of G = 16 (NVFP4, (Chmiel et al., 25 May 2025)) or G = 32 (MXFP4, (Tseng et al., 27 Feb 2025, Egiazarian et al., 27 Sep 2025)) share a single floating-point scale (usually 8-bit, e.g., E8M0 or E4M3).
- Quantization: each real tensor block is divided by the block scale and mapped to the nearest representable FP4 value.
- Dequantization: , with the E2M1 code.
- Scale selection: generally , where is the largest representable FP4 value (Chmiel et al., 25 May 2025, Samson et al., 1 Jul 2024).
- Block-wise scaling enables the dynamic range of the overall format to be much wider than single E2M1 (from 12× span for base FP4 to span when group scaling is applied).
2. Training Algorithms and Stochastic Rounding
A key challenge in using MXFP4 for training is accurate, unbiased propagation of gradient updates despite aggressive quantization noise.
- Stochastic rounding (SR):
- For gradients, forward pass, or backward pass, stochastic rounding involves mapping each real value to either of its nearest quantized neighbors with a probability proportional to proximity.
- For a value and grid distance :
- Ensures . Crucial for unbiased SGD/Adam updates (Tseng et al., 27 Feb 2025, Castro et al., 20 May 2025, Chmiel et al., 25 May 2025).
Variance reduction:
- Block-wise quantization can cause high error variance due to outlier-induced scale inflation.
- Random Hadamard Transform (RHT):
- Preprocess group blocks with Hadamard mixing: for block , (Hadamard + sign) so that outliers are mixed across all elements, lowering exposure and greatly reducing quantization error variance.
- This transformation is proven to lower variance from to (Tseng et al., 27 Feb 2025, Egiazarian et al., 27 Sep 2025, Castro et al., 20 May 2025).
- Empirically, RHT is essential for near-lossless LLM training in MXFP4.
- Rounding schedule:
- Forward pass: Round-to-nearest (deterministic) for weights and activations
- Backward and parameter update: Stochastic rounding for gradients (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025, Tseng et al., 27 Feb 2025).
3. Architectural and Algorithmic Enhancements
The literature identifies both architectural bottlenecks and algorithmic remedies that make MXFP4 training feasible across applications.
| Scheme / Format | Block Size | Scale Format | Rounding fwd/bwd | Extra transform | Notes |
|---|---|---|---|---|---|
| MXFP4 (Tseng et al., 27 Feb 2025, Castro et al., 20 May 2025) | 32 | E8M0/E4M3 | RTN/SR | Hadamard (SR + RHT) | LLM, Vision, QAT |
| NVFP4 (Chmiel et al., 25 May 2025) | 16 | E4M3 | RTN/SR | n/a | LLM, QAT |
Block size is chosen according to hardware and application: on Blackwell/NVIDIA for MXFP4 (Tseng et al., 27 Feb 2025, Egiazarian et al., 27 Sep 2025), for NVFP4 (Chmiel et al., 25 May 2025). FPGA implementations (Samson et al., 1 Jul 2024) also use (with some improvement seen at ).
- Forward/Backward block shape symmetry: To ensure correct group scaling for different GEMM layouts, modern methods (e.g., TetraJet (Chen et al., 28 Feb 2025), Quartet (Castro et al., 20 May 2025)) double-quantize inputs with both 1x32 and 32x1 block layouts.
- Oscillation reduction (Vision): "TetraJet" (Chen et al., 28 Feb 2025) introduces Q-EMA (Exponential Moving Average quantizer) and Q-Ramping (adaptive optimizer) to suppress weight flipping near block grid thresholds, closing >50% of the accuracy gap in vision transformers.
- Unbiased STE: Consistent application of double quantization plus unbiased rounding across forward and backward ensures SGD/Adam convergence under standard theory (Chen et al., 28 Feb 2025).
4. Empirical Results and Scaling Laws
MXFP4 training, enabled by efficient rounding and block-wise mixing, achieves near-baseline quality across modalities with substantial computational gains.
- LLMs:
- (Tseng et al., 27 Feb 2025) demonstrates GPT-style models up to 6.7B parameters trained on hundreds of billions of tokens using BF16 in the forward and MXFP4+SR+RHT in the backward pass: validation perplexity within 0.01–0.02 of full-precision BF16.
- (Castro et al., 20 May 2025) ("Quartet") presents end-to-end MXFP4 training that closes the loss gap to FP16/FP8, with scaling law analysis. MXFP4 enables 1.8–2.6× end-to-end training speedup over FP8/BF16.
- (Chmiel et al., 25 May 2025) ("FP4 All the Way") shows fully MXFP4-analogous (NVFP4) training of a 7B LLM on 256 Gaudi2s, with mixed RTN/SR, matches or beats BF16 downstream accuracy (e.g., Lambada, HellaSwag, Winogrande, etc.)
- Vision:
- (Chen et al., 28 Feb 2025) achieves ImageNet top-1 accuracy within ≲1–2% of the full-precision baseline for DeiT/Swin transformer models trained for 90 epochs, using Q-EMA/Q-Ramping for oscillation reduction.
- (Samson et al., 1 Jul 2024) demonstrates ResNet-18 on ImageNet matched to within 2.6% of FP32 using QAT with MXFP4 and a Brevitas extension.
- Quantization-aware training (QAT) vs post-training quantization (PTQ):
- QAT recovers the majority of accuracy lost, whereas naive PTQ in pure MXFP4 format results in >10% performance drop (Samson et al., 1 Jul 2024).
- Scaling laws:
- Quartet's low-precision scaling law (Castro et al., 20 May 2025) quantitatively relates model performance to forward/backward bitwidths, parameter/data count, and empirically fit "parameter efficiency" and "data efficiency." It predicts substantial compute-vs-accuracy improvements at 4b.
5. Post-Training Quantization and Model Compression
MXFP4 is effective for both QAT and high-throughput PTQ workflows:
- PTQ for LLMs:
- (Egiazarian et al., 27 Sep 2025) introduces MR-GPTQ (Micro-Rotated GPTQ), combining block-wise Hadamard transforms, GPTQ error-compensation, and static act-ordering. It boosts MXFP4 inference to 3.6× FP16 per-layer and 2.2× end-to-end speedups, achieving ~93–99% of FP16 accuracy in large LLMs.
- Pre-quantization with optimized channel scaling and low-rank branches (GPTQ + Low-Rank) is highly effective—rotation further helps in INT4 but is less critical for MXFP4 due to the E2M1 grid's inherent dynamic range (Liu et al., 23 Jul 2025, Egiazarian et al., 27 Sep 2025).
| Method | INT4 PPL | MXFP4 PPL |
|---|---|---|
| RTN | 923.7 | 15.2 |
| +GPTQ | 1007.3 | 13.2 |
| +Low-Rank | 723.5 | 15.6 |
| +GPTQ+Low-Rank | 578.6 | 12.7 |
| Rot+Scale+GPTQ | 11.73 | 12.29 |
Table: Low perplexity (PPL) values for MXFP4 approaches INT4 when using optimized scaling and error-mitigation (Liu et al., 23 Jul 2025).
- FPGA implementations:
- Open-source designs (Samson et al., 1 Jul 2024) support all OCP-MX formats and MXFP4-specific arithmetic, with block size preferred for area/power efficiency.
- Typical end-to-end flows: PyTorch+QAT → Brevitas+MX quantization → ONNX export → Vivado HLS → hardware deployment, achieving 67.2% ImageNet top-1 for ResNet-18 at much lower LUT/energy cost than INT4 GPU.
6. Practical Deployment and Best Practices
Deployment of MXFP4 for both training and inference is now supported across major hardware and software stacks, with the following considerations:
- Hardware
- NVIDIA Blackwell (SM100/SM120) Tensor Cores natively accelerate MXFP4 GEMMs with block-wise scaling (Castro et al., 20 May 2025, Tseng et al., 27 Feb 2025).
- AMD, Intel Gaudi2 accelerators, and FPGAs (OCP MX IP) also support MXFP4 or near-variants (Samson et al., 1 Jul 2024, Chmiel et al., 25 May 2025).
- Recommended block sizes:
- for MXFP4 on Blackwell/NVIDIA, FPGAs.
- for NVFP4 (smaller group granularity).
- Quantization recipes
- Use per-block RMSE-optimal scaling (), with deterministic clipping in forward, unbiased stochastic rounding in backward.
- Always bundle Hadamard transforms in the kernel for stochastic rounding variance reduction.
- For oscillation: in vision, supplement with Q-EMA and/or Q-Ramping if weight flipping is evident (Chen et al., 28 Feb 2025).
- Model initialization and finetuning:
- Start from accurate FP32/BF16 weights, then QAT for stability; QAF (Quantization-Aware Finetuning in higher precision) can close remaining accuracy gap as gradient noise becomes a limiting factor (Chmiel et al., 25 May 2025).
- Monitoring quantization effectiveness:
- Use the “” rule (Chmiel et al., 25 May 2025): when (gradient norm to quantization noise ratio), quantized updates are no longer useful; switch to higher precision for updates or finetune.
7. Limitations and Future Directions
- Scale quantization and MSE:
- Power-of-two scale quantization, as in E8M0, introduces scale flipping effects (up to ±50% relative error at block extremes); block-wise Hadamard and per-block scale optimization are essential (Egiazarian et al., 27 Sep 2025).
- Current approaches require careful block size selection; smaller blocks improve accuracy but increase overhead.
- Oscillation phenomena:
- Long-term stability of MXFP4-trained weights may suffer from flipping near quantization thresholds; advanced mitigations (Q-EMA, Q-Ramping) are required for vision but less studied in LLMs.
- Numerical range and outliers:
- While E2M1 provides nonuniform resolution and dynamic range, outliers can still degrade accuracy—rotation and block-wise transformations can alleviate but not fully resolve for extreme cases.
- Hardware support:
- MXFP4 is fully supported only on recent hardware (Blackwell, select FPGAs, Gaudi2 with emulation); portability to legacy devices is pending.
- Open research avenues:
- Unified architectures for MXFP4 kernels (GEMM+Hadamard+quantize in fused hardware) are under active development (Tseng et al., 27 Feb 2025).
- Application to non-GEMM operations (softmax, layer norm) may require new variance-reducing transforms (Tseng et al., 27 Feb 2025).
- Further research is ongoing on optimal block sizes, dynamic scaling logic, and extensions to convolutional and graph architectures.
References
- (Samson et al., 1 Jul 2024) Exploring FPGA designs for MX and beyond
- (Tseng et al., 27 Feb 2025) Training LLMs with MXFP4
- (Chen et al., 28 Feb 2025) Oscillation-Reduced MXFP4 Training for Vision Transformers
- (Castro et al., 20 May 2025) Quartet: Native FP4 Training Can Be Optimal for LLMs
- (Chmiel et al., 25 May 2025) FP4 All the Way: Fully Quantized Training of LLMs
- (Liu et al., 23 Jul 2025) A Comprehensive Evaluation on Quantization Techniques for LLMs
- (Egiazarian et al., 27 Sep 2025) Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization