Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVFP4 Low-Precision Pretraining

Updated 17 April 2026
  • The paper introduces NVFP4, a hardware-accelerated 4-bit pretraining paradigm that achieves near-BF16 accuracy with significant speedup and memory reduction.
  • It employs advanced quantization techniques such as per-block E2M1 scaling, stochastic rounding, and selective high-precision layers to ensure stable training.
  • Experimental validations on models up to 120B parameters demonstrate that NVFP4 closes over half the loss gap, delivering efficiency gains of 2–4× compared to conventional methods.

NVFP4 Low-Precision Pretraining is a hardware-accelerated, block-microscaled 4-bit floating-point (FP4) training paradigm for large Transformer and Mixture-of-Experts models, designed to maximize training throughput and memory efficiency while minimizing loss and accuracy degradation relative to standard 16-bit floating-point workflows. NVFP4 builds on per-block E2M1 quantization with FP8-scale factors, uses multi-pronged suppression of quantization artifacts (including stochastic rounding, outlier management, and oscillation control), is supported natively on NVIDIA Blackwell GPUs, and has been validated in multi-trillion-token pretraining runs up to the 120B parameter scale (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025).

1. NVFP4 Format and Quantization Principles

NVFP4 is a mixed-precision, block-microscaled FP4 number format. Every micro-block of 16 floating-point E2M1 values (±0,±0.5,±1,±1.5,±2,±3,±4,±6{±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}) shares a single signed block scale in E4M3 (FP8) format, plus (optionally) a global FP32 scale for large tensors (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, NVIDIA et al., 14 Apr 2026). This multi-stage scaling enables an effective block dynamic range of [6448,+6448][-6 \cdot 448, +6 \cdot 448] and improved resistance to outlier-induced clipping. Quantization proceeds as follows for a block x1:16x_{1:16}:

  • Compute scale s=QE4M3(maxjxj)s = Q_{\mathrm{E4M3}}(\max_j |x_j|).
  • Each element is quantized: x^j=sQE2M1(xj/s)\hat{x}_j = s \cdot Q_{\mathrm{E2M1}}(x_j / s).
  • Dequantization is simply xjx^jx_j \approx \hat{x}_j.

Hybrid workflows include two-dimensional (row/column) block scaling for weights (enabling exact forward/backward GEMM consistency), and one-dimensional (contiguous axis) block scaling for activations and gradients (to maximize hardware GEMM kernel efficiency) (NVIDIA et al., 29 Sep 2025, NVIDIA et al., 14 Apr 2026).

2. Core Algorithmic Advances for Stable Training

Low-precision training introduces unique numerical challenges: dynamic range underflow/overflow, quantized-gradient bias, and block-level representational collapse driven by outliers. NVFP4 pretraining addresses these through a suite of techniques:

The integration of these techniques enables end-to-end NVFP4 quantization in all GEMM-heavy linear modules within LLMs and hybrid MoE architectures (NVIDIA et al., 14 Apr 2026, Chen et al., 31 Oct 2025).

3. Advanced Outlier and Oscillation Mitigation

Two dominant sources of accuracy degradation are (i) persistent outlier channels (driven by softmax, gating/nonlinearities, or SwiGLU) and (ii) quantization-induced weight oscillation. NVFP4 pipelines have developed specific extensions:

  • Hot-Channel Patch (HCP) & CHON Recipe: Identify persistent hot channels (per-channel quantization residual norms), reinject residuals in a hardware-fused second-order correction during GEMM, and maintain higher-precision quantization (usually BF16) for “post-QK” and other outlier-sensitive projections (Dong et al., 2 Feb 2026).
  • OutControl: Static masking of a small fraction (e.g., 10%) of channel dimensions—selected by running 2\|\cdot\|_2 norm—in higher precision, with all remaining compute in NVFP4. This is implemented in both forward and backward linear passes (Chen et al., 31 Oct 2025).
  • Oscillation Control (OsciReset, Q-EMA, Q-Ramping): Periodically detect and reset weights that oscillate at FP4 quantization thresholds, forcing the master weights to bin center and thus reducing the frequency of precision-induced output jumps. Q-EMA uses an exponential moving average of master weights for forward quantizer selection, Q-Ramping delays updates for high-oscillation weights and amplifies gradient steps to push weights off quantization midpoints (Chen et al., 31 Oct 2025, Chen et al., 28 Feb 2025).

Empirically, this unified set of mitigation techniques closes more than half the loss/accuracy gap between naive FP4 and full precision, and in large runs brings NVFP4 accuracy to within 1% of BF16 baselines (Chen et al., 31 Oct 2025, Dong et al., 2 Feb 2026).

4. Key Innovations and Algorithmic Recipes

Notable algorithmic contributions include:

  • Four-Over-Six (4/6) Adaptive Block Scaling: Each block is quantized twice (scaling min/max to ±4 and ±6), selecting the scale with lowest quantization mean-squared error. This approach directly targets the elevated error on near-maximal FP4 values, avoids divergence in models where standard NVFP4 is unstable, and is highly efficient on NVIDIA Blackwell GPUs (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026).
  • MS-EDEN (Micro-Scaling EDEN) Gradient Quantizer: Provides unbiased, blockwise, and post-RHT quantization for gradients, with more than 2×\times lower MSE than stochastic rounding at 4 bits (Panferov et al., 30 Jan 2026).
  • Progressive Learning and MoR (Mixture of Representations): Progressive learning decays the influence of an ancillary BF16 or FP16 branch during training, guiding the low-precision network into a good solution manifold before “dropping” the full-precision branch (Zhou et al., 2019). MoR dynamically analyzes numerical statistics per-tensor or block and assigns the lowest-precision format (NVFP4, FP8, BF16) meeting a per-block error threshold, maximizing efficiency while controlling rare high-error blocks (Su et al., 28 Dec 2025).

These recipes are compatible with and often synergize with the standard NVFP4 convergence- and stability-improving measures. For example, “Quartet II” combines (4/6)-enhanced forward quantization with MS-EDEN unbiased gradient estimation and achieves state-of-the-art fully-NVFP4 LLM pretraining (Panferov et al., 30 Jan 2026).

5. Large-Scale Experimental Validations

NVFP4 low-precision pretraining has been validated in models up to 120B parameters and sequence lengths up to 1M, with total training tokens exceeding 25T (NVIDIA et al., 14 Apr 2026). Experiments on 7B-, 12B-, and 13B-parameter Llama, OLMo, and Nemotron family models confirm:

6. Hardware and Implementation Aspects

NVFP4 is natively supported on NVIDIA Blackwell GPUs, including fused block quantization/dequantization, tensor-core GEMMs operating directly on packed FP4/E4M3 pairs, and fast stochastic rounding/PTX instructions for unbiased quantization (NVIDIA et al., 14 Apr 2026, NVIDIA et al., 29 Sep 2025). Custom CUDA kernels implement RHT, hot-channel compensation, and blockwise scale selection (e.g., 4/6). The underlying software typically extends NVIDIA’s Transformer Engine, cuBLAS, or PyTorch/TensorFlow AMP for fine-grained, mixed-precision graph annotation.

Best-practice reproducibility recommendations include:

7. Impact, Limitations, and Future Directions

NVFP4-enabled pretraining enables scaling to massive model and data regimes with substantially reduced hardware cost and memory footprint (Chmiel et al., 25 May 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026). Persistent limitations include:

Future research is directed at (i) eliminating all remaining high-precision blocks/layers, (ii) generalizing block-adaptive quantization, (iii) combining learnable block rotations and latent exponent sharing, and (iv) extending robust NVFP4 recipes to mixture-of-experts, retrieval-augmented, and extremely long-context LLMs (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026, Su et al., 28 Dec 2025, Dong et al., 2 Feb 2026).


Key References: (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 14 Apr 2026, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025, Cook et al., 1 Dec 2025, Su et al., 28 Dec 2025, Chen et al., 28 Feb 2025, Zhou et al., 2019, Dong et al., 2 Feb 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVFP4 Low-Precision Pretraining.