Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 57 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 449 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MXFP8 Training Recipe: Efficient FP8 Pre-Training

Updated 5 October 2025
  • MXFP8 training recipes are a set of methodologies utilizing the MXFP8-E4M3 format to enable efficient, low-precision pre-training of transformer models.
  • They employ block-level scaling, fine-grained per-block quantization, and upward rounding of exponents to minimize quantization errors while matching or exceeding BF16 performance.
  • Integrating structured sparsity with these recipes achieves significant hardware efficiency gains, including improved throughput, reduced memory usage, and lower energy consumption.

MXFP8 training recipes define a set of methodologies and algorithmic choices for efficient large-scale pre-training of transformer models using the MXFP8-E4M3 floating-point format, and address practical trade-offs between computational performance, numerical fidelity, and hardware efficiency. These recipes combine microscaling quantization approaches, fine-grained per-block scaling, rigorous conversion algorithms, and careful parameter selection to match or exceed the performance of higher-precision baselines (e.g., BF16), while achieving significant improvements in throughput and resource utilization.

1. MXFP8-E4M3 Datatype and Quantization Fundamentals

The MXFP8-E4M3 format is an 8-bit floating-point type with 1 sign bit, 4 exponent bits, and 3 mantissa bits, supporting approximately 17.8 binades of dynamic range. In MX quantization, critical tensors—including weights, activations, and gradients—are cast into MXFP8 blocks, each scaled by a fine-grained block-level scale factor.

A floating-point number in this format is

(1)s×(1.m)×2e–bias,(–1)^s \times (1.m) \times 2^{e – \text{bias}},

with ss the sign bit, ee the exponent, mm the mantissa.

Quantization proceeds as follows:

  • Each block of 32 elements (ViV_i) is divided by a scale factor XX,
  • Scale factor XX is computed as X=amax/destmaxX = \text{amax} / \text{destmax}, where amax=maxVi\text{amax} = \max{|V_i|} and destmax\text{destmax} is the maximum representable value in MXFP8-E4M3.
  • The exponent of XX is computed (log2(X)\log_2(X)), then rounded upward (\lceil\cdot\rceil). This upward rounding, and subsequent clamping to the representable UE8M0 range, ensures no overflow occurs during quantization.
  • Each value is quantized with round-to-nearest-ties-to-even (RN) and saturating conversions.

This approach allows all core tensors to remain in low-precision format, maximizing hardware throughput and minimizing memory usage (Mishra et al., 30 May 2025).

2. Block-Level Scaling and Parameter Choices

Implementations fix the block size at K=32K=32, such that each block shares a scale factor. All tensor types—weights, activations, gradients—are consistently quantized to E4M3. The scale factor is critical to minimize quantization error and clipping; rounding the exponent upward avoids representational overflow and excess quantization noise, a strategy contrasted with earlier OCP v1.0 schemes that round downward.

Empirical evaluations demonstrate that using E4M3 throughout is preferable. Attempts to use E5M2 for some tensors (e.g., gradients in 8B LLMs) resulted in degraded perplexity.

A summary table contrasting parameter choices appears below:

Parameter MXFP8 Recipe OCP v1.0 Variant
Block Size KK 32 64–128
Scale Exponent Upward Downward
Data Type E4M3 E5M2/BF16 mix
Rounding Mode RN Varies

3. Quantization Algorithms and Hardware Integration

The quantization algorithm centers on per-block scaling, saturating conversion, and careful rounding. The workflow involves:

  • Compute amax=maxViamax = \max{|V_i|}
  • Derive Xfloat=amax/destmaxX_{float} = amax / destmax
  • expX=log2(Xfloat)exp_X = \log_2(X_{float})
  • expXint=expXexp_{Xint} = \lceil exp_X \rceil
  • X=clamp(expXint,127,127)+biasX = \text{clamp}(exp_{Xint}, -127, 127) + \text{bias}
  • Quantize values: Qi=Quantize_to_fp8(Vi/X)Q_i = \text{Quantize\_to\_fp8}(V_i / X)

Implementation is natively supported in NVIDIA Blackwell-generation GPUs, leveraging Tensor Core acceleration for MX-format tensors. Quantization occurs for all tensor types, unlike previous schemes, and two copies of each tensor may be kept for different compute axes, incurring a memory overhead (Mishra et al., 30 May 2025).

Efficiency gains include a 2×\sim2\times increase in throughput versus BF16, reduced quantization error due to block granularity, and simplified mixed-precision handling across transformer blocks.

4. Structured Sparsity and Decaying Pruning Mask Integration

Structured sparsity techniques, particularly N:M sparsity, can be synergistically combined with MXFP8 recipes (Kao et al., 2022). In N:M sparsity, each block of MM weights retains NN nonzeros; e.g., 2:4 sparsity.

Key recipe innovations include:

  • Pruning Mask Decay: The pruning threshold decays over training iterations (e.g., threshold(t)=threshold0exp(λt)threshold(t) = threshold_0 \cdot \exp(-λt)), allowing recovery from premature pruning and adaptive adjustment.
  • Sparse Structure Decay: The eligible nonzero blocks are refined over time, smoothing the transition from dense to sparse configuration.

The integration of structured sparsity with MXFP8 yields additional benefits: further reduced FLOPs, enhanced hardware acceleration, and lower energy consumption. However, hyperparameter tuning must account for quantization noise and decay dynamics.

5. Training Results and Empirical Validation

In models up to 8B parameters and 15T tokens, MXFP8 recipes produce validation perplexity within 0.5% of BF16, and downstream performance on MMLU and reasoning benchmarks (ARC, Winogrande, PIQA) is at least comparable. Training curves match closely with BF16, demonstrating the numerical robustness of the conversion algorithms.

Efficiency outcomes includ up to 22% reductions in training time, up to 14% lower peak memory, and 19% increases in throughput compared with BF16 (Wang et al., 26 Sep 2025). In continual pre-training over 160B tokens, FP8 models remained stable and lossless (i.e., no significant accuracy drop), even outperforming BF16 in some reasoning benchmarks.

For structured sparsity integration, models with decaying pruning schemes achieve SOTA accuracy, comparable to unstructured sparsity approaches, at marginally increased training compute.

6. Practical Implementation and Scaling Considerations

MXFP8 recipes are implemented in NVIDIA’s Transformer Engine with cuDNN and cuBLAS library support. The conversion process and rounding mode require hardware acceleration for practical performance.

Quantization schemes allow practitioners to scale to large batch sizes and sequence lengths without excess overhead. However, keeping two versions of each tensor for orthogonal axes increases memory footprint—a focus for future optimization.

A plausible implication is that integrating MXFP8 quantization with structured sparsity using mask/structure decay facilitates training efficiency, energy reduction, and improved data locality. However, managing the interplay of quantization-induced errors and pruning dynamics remains a nontrivial challenge for recipe tuning.

7. Implications and Future Directions

Widespread adoption of MXFP8 and related FP8 recipes is supported by comprehensive code releases (Wang et al., 26 Sep 2025), facilitating democratization of large-scale LLM training. Hybrid quantization strategies—blockwise for weights, tokenwise for activations—combined with structured sparsity provide a practical path for efficient, stable, and numerically sound low-precision training.

Parametric choices such as scale factor rounding modes, block size, and selection of E4M3 for all tensor types are critical for achieving robustness. Ongoing improvements are aimed at reducing memory overhead, refining hardware-software integration, and dynamically tuning hyperparameters for quantized sparse models.

The convergence of structured sparsity with advanced FP8 formats is a promising approach to addressing the computational demands of modern LLMing, particularly as model scales and dataset sizes continue to increase.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MXFP8 Training Recipe.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube