NVFP4 Low-Precision Pretraining

Updated 17 April 2026

The paper introduces NVFP4, a hardware-accelerated 4-bit pretraining paradigm that achieves near-BF16 accuracy with significant speedup and memory reduction.
It employs advanced quantization techniques such as per-block E2M1 scaling, stochastic rounding, and selective high-precision layers to ensure stable training.
Experimental validations on models up to 120B parameters demonstrate that NVFP4 closes over half the loss gap, delivering efficiency gains of 2–4× compared to conventional methods.

NVFP4 Low-Precision Pretraining is a hardware-accelerated, block-microscaled 4-bit floating-point (FP4) training paradigm for large Transformer and Mixture-of-Experts models, designed to maximize training throughput and memory efficiency while minimizing loss and accuracy degradation relative to standard 16-bit floating-point workflows. NVFP4 builds on per-block E2M1 quantization with FP8-scale factors, uses multi-pronged suppression of quantization artifacts (including stochastic rounding, outlier management, and oscillation control), is supported natively on NVIDIA Blackwell GPUs, and has been validated in multi-trillion-token pretraining runs up to the 120B parameter scale (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025).

1. NVFP4 Format and Quantization Principles

NVFP4 is a mixed-precision, block-microscaled FP4 number format. Every micro-block of 16 floating-point E2M1 values ( ${±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}$ ) shares a single signed block scale in E4M3 (FP8) format, plus (optionally) a global FP32 scale for large tensors (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, NVIDIA et al., 14 Apr 2026). This multi-stage scaling enables an effective block dynamic range of $[-6 \cdot 448, +6 \cdot 448]$ and improved resistance to outlier-induced clipping. Quantization proceeds as follows for a block $x_{1:16}$ :

Compute scale $s = Q_{\mathrm{E4M3}}(\max_j |x_j|)$ .
Each element is quantized: $\hat{x}_j = s \cdot Q_{\mathrm{E2M1}}(x_j / s)$ .
Dequantization is simply $x_j \approx \hat{x}_j$ .

Hybrid workflows include two-dimensional (row/column) block scaling for weights (enabling exact forward/backward GEMM consistency), and one-dimensional (contiguous axis) block scaling for activations and gradients (to maximize hardware GEMM kernel efficiency) (NVIDIA et al., 29 Sep 2025, NVIDIA et al., 14 Apr 2026).

2. Core Algorithmic Advances for Stable Training

Low-precision training introduces unique numerical challenges: dynamic range underflow/overflow, quantized-gradient bias, and block-level representational collapse driven by outliers. NVFP4 pretraining addresses these through a suite of techniques:

Random Hadamard Transform (RHT): Applied before quantization (especially weight gradients), RHT spreads isolated outlier values across a block, reducing maximal quantization error and block kurtosis (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, Dong et al., 2 Feb 2026).
2D Quantization: For weights, simultaneous row- and column-block scaling ensures consistent quantized values for the same block, in both forward and backward passes, avoiding chain-rule and transpose inconsistencies (NVIDIA et al., 29 Sep 2025).
Stochastic Rounding (SR): Gradients are quantized via probabilistic selection between adjacent FP4 levels, yielding an unbiased estimator $\mathbb{E}[\hat{g}] = g$ and controlled variance (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, Chen et al., 31 Oct 2025).
Selective High-Precision Layers: Final and sensitive layers (typically last 10–20%, embeddings, output projections) are retained in BF16, as their quantization error most affects downstream accuracy (NVIDIA et al., 29 Sep 2025, NVIDIA et al., 14 Apr 2026).
Double-Block Quantization, Outlier Clamping/Compensation: Additional error suppression and representational uniformity via clamping activations to high quantiles and computing residual corrections only on the subspace of “hot” persistent outlier channels (Chmiel et al., 25 May 2025, Chen et al., 31 Oct 2025, Dong et al., 2 Feb 2026).

The integration of these techniques enables end-to-end NVFP4 quantization in all GEMM-heavy linear modules within LLMs and hybrid MoE architectures (NVIDIA et al., 14 Apr 2026, Chen et al., 31 Oct 2025).

3. Advanced Outlier and Oscillation Mitigation

Two dominant sources of accuracy degradation are (i) persistent outlier channels (driven by softmax, gating/nonlinearities, or SwiGLU) and (ii) quantization-induced weight oscillation. NVFP4 pipelines have developed specific extensions:

Hot-Channel Patch (HCP) & CHON Recipe: Identify persistent hot channels (per-channel quantization residual norms), reinject residuals in a hardware-fused second-order correction during GEMM, and maintain higher-precision quantization (usually BF16) for “post-QK” and other outlier-sensitive projections (Dong et al., 2 Feb 2026).
OutControl: Static masking of a small fraction (e.g., 10%) of channel dimensions—selected by running $\|\cdot\|_2$ norm—in higher precision, with all remaining compute in NVFP4. This is implemented in both forward and backward linear passes (Chen et al., 31 Oct 2025).
Oscillation Control (OsciReset, Q-EMA, Q-Ramping): Periodically detect and reset weights that oscillate at FP4 quantization thresholds, forcing the master weights to bin center and thus reducing the frequency of precision-induced output jumps. Q-EMA uses an exponential moving average of master weights for forward quantizer selection, Q-Ramping delays updates for high-oscillation weights and amplifies gradient steps to push weights off quantization midpoints (Chen et al., 31 Oct 2025, Chen et al., 28 Feb 2025).

Empirically, this unified set of mitigation techniques closes more than half the loss/accuracy gap between naive FP4 and full precision, and in large runs brings NVFP4 accuracy to within 1% of BF16 baselines (Chen et al., 31 Oct 2025, Dong et al., 2 Feb 2026).

4. Key Innovations and Algorithmic Recipes

Notable algorithmic contributions include:

Four-Over-Six (4/6) Adaptive Block Scaling: Each block is quantized twice (scaling min/max to ±4 and ±6), selecting the scale with lowest quantization mean-squared error. This approach directly targets the elevated error on near-maximal FP4 values, avoids divergence in models where standard NVFP4 is unstable, and is highly efficient on NVIDIA Blackwell GPUs (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026).
MS-EDEN (Micro-Scaling EDEN) Gradient Quantizer: Provides unbiased, blockwise, and post-RHT quantization for gradients, with more than 2 $\times$ lower MSE than stochastic rounding at 4 bits (Panferov et al., 30 Jan 2026).
Progressive Learning and MoR (Mixture of Representations): Progressive learning decays the influence of an ancillary BF16 or FP16 branch during training, guiding the low-precision network into a good solution manifold before “dropping” the full-precision branch (Zhou et al., 2019). MoR dynamically analyzes numerical statistics per-tensor or block and assigns the lowest-precision format (NVFP4, FP8, BF16) meeting a per-block error threshold, maximizing efficiency while controlling rare high-error blocks (Su et al., 28 Dec 2025).

These recipes are compatible with and often synergize with the standard NVFP4 convergence- and stability-improving measures. For example, “Quartet II” combines (4/6)-enhanced forward quantization with MS-EDEN unbiased gradient estimation and achieves state-of-the-art fully-NVFP4 LLM pretraining (Panferov et al., 30 Jan 2026).

5. Large-Scale Experimental Validations

NVFP4 low-precision pretraining has been validated in models up to 120B parameters and sequence lengths up to 1M, with total training tokens exceeding 25T (NVIDIA et al., 14 Apr 2026). Experiments on 7B-, 12B-, and 13B-parameter Llama, OLMo, and Nemotron family models confirm:

Downstream evaluation loss (validation and task accuracy) remains within $\approx$ 1% of FP8 or BF16 baselines (NVIDIA et al., 29 Sep 2025, NVIDIA et al., 14 Apr 2026, Chen et al., 31 Oct 2025).
Task accuracy for representative LLM benchmarks (MMLU, GSM8K-CoT, HumanEval, HellaSwag, Winograd) is indistinguishable in most cases (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 14 Apr 2026).
State-of-the-art FQT (fully quantized training) with NVFP4 can match or exceed BF16 efficiency (~2–4 $[-6 \cdot 448, +6 \cdot 448]$ 0 speedup, ~2 $[-6 \cdot 448, +6 \cdot 448]$ 1 memory reduction) (Panferov et al., 30 Jan 2026, NVIDIA et al., 29 Sep 2025, NVIDIA et al., 14 Apr 2026, Chmiel et al., 25 May 2025).
In vision and smaller LLMs, Q-EMA/Q-Ramping, OutControl, and hybrid MoR further enhance loss convergence and stability, with up to 1.3 $[-6 \cdot 448, +6 \cdot 448]$ 2 memory traffic reduction compared to FP8 (Chen et al., 28 Feb 2025, Su et al., 28 Dec 2025, NVIDIA et al., 29 Sep 2025).

6. Hardware and Implementation Aspects

NVFP4 is natively supported on NVIDIA Blackwell GPUs, including fused block quantization/dequantization, tensor-core GEMMs operating directly on packed FP4/E4M3 pairs, and fast stochastic rounding/PTX instructions for unbiased quantization (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 29 Sep 2025). Custom CUDA kernels implement RHT, hot-channel compensation, and blockwise scale selection (e.g., 4/6). The underlying software typically extends NVIDIA’s Transformer Engine, cuBLAS, or PyTorch/TensorFlow AMP for fine-grained, mixed-precision graph annotation.

Best-practice reproducibility recommendations include:

Do not increase learning rate solely due to NVFP4 regularization; adjust only if empirical validation suggests benefit (NVIDIA et al., 14 Apr 2026).
Monitor per-layer and per-block underflows and fallback to higher precision if excessive zero gradients are observed (NVIDIA et al., 14 Apr 2026, Chen et al., 31 Oct 2025).
Use conservative error thresholds (e.g., 1% mean relative error for NVFP4 blocks in MoR) to preserve accuracy (Su et al., 28 Dec 2025).

7. Impact, Limitations, and Future Directions

NVFP4-enabled pretraining enables scaling to massive model and data regimes with substantially reduced hardware cost and memory footprint (Chmiel et al., 25 May 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026). Persistent limitations include:

Block-level outlier resistance is not perfect; rare pathological blocks continue to require higher-precision fallback, and fine-grained control remains an open challenge (Dong et al., 2 Feb 2026, Su et al., 28 Dec 2025).
Some oscillation-mitigation strategies (e.g., Q-Ramping) imply tracking per-weight statistics, introducing minor implementation overhead (Chen et al., 28 Feb 2025, Chen et al., 31 Oct 2025).
The 4/6 grid scaling is currently incompatible with MXFP4 (E8M0-scaled FP4) (Cook et al., 1 Dec 2025).

Future research is directed at (i) eliminating all remaining high-precision blocks/layers, (ii) generalizing block-adaptive quantization, (iii) combining learnable block rotations and latent exponent sharing, and (iv) extending robust NVFP4 recipes to mixture-of-experts, retrieval-augmented, and extremely long-context LLMs (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026, Su et al., 28 Dec 2025, Dong et al., 2 Feb 2026).

Key References: (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025, Cook et al., 1 Dec 2025, Su et al., 28 Dec 2025, Chen et al., 28 Feb 2025, Zhou et al., 2019, Dong et al., 2 Feb 2026)