NVIDIA FP4 Quantization Algorithm
- NVFP4 is a 4-bit floating-point quantization scheme using E2M1 encoding with block microscaling to enable aggressive precision reduction in LLMs.
- The algorithm employs adaptive techniques such as unbiased stochastic rounding, double-block scaling, and four-over-six adaptive scaling to maintain near-lossless accuracy.
- Empirical results show NVFP4 reduces memory and energy consumption while achieving up to 3.6× inference speedup and 1–2% accuracy gap compared to higher-precision baselines.
NVIDIA FP4 (NVFP4) Quantization Algorithm
NVIDIA FP4 (NVFP4) quantization is a hardware-native, block-microscaled, 4-bit floating-point scheme implemented in NVIDIA Blackwell-class GPUs to accelerate LLM training and inference. NVFP4 enables aggressive precision reduction for weights, activations, and gradients, yielding substantial increases in throughput and substantial reductions in memory and energy consumption, while maintaining near-lossless accuracy on large-scale transformers. NVFP4 achieves this by using carefully engineered per-block scaling, unbiased stochastic rounding, algorithmic outlier suppression, and adaptive scaling extensions, as described in recent works including TetraJet-v2 (Chen et al., 31 Oct 2025), Four Over Six (4/6) adaptive scaling (Cook et al., 1 Dec 2025), and a suite of benchmark studies and production frameworks.
1. Format Definition: E2M1 Floating-Point and Block Microscaling
NVFP4 encodes each value as an E2M1 4-bit floating-point number, consisting of:
- 1 sign bit
- 2 exponent bits (bias=1)
- 1 mantissa bit
The representable value set is
yielding a dynamic range of (Chen et al., 31 Oct 2025, Cook et al., 1 Dec 2025).
To address the limited representational capacity of FP4, NVFP4 attaches an 8-bit E4M3 floating-point scale to each consecutive block of 16 elements (block size ), known as block microscaling. The E4M3 scale is capable of encoding scaling factors in , so each block may cover a dynamic range of approximately after scaling. For full-tensor scaling, an additional FP32 scale per tensor may be applied to maximize representability in high-dynamic-range data (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, Chen et al., 29 Oct 2025).
Comparison to MXFP4: MXFP4 uses blocks of 32, and an 8-bit E8M0 (integer-only) scale, resulting in greater quantization error due to less precise scaling (Chen et al., 31 Oct 2025, Egiazarian et al., 27 Sep 2025).
2. Block-wise and Double-block Scaling Quantization Schemes
The NVFP4 quantization process proceeds as follows for each block of 16 values :
- Block scale selection:
Quantize to E4M3, denoted .
- Quantize each element:
where the rounding is to nearest-representable E2M1 value.
- Storage: Store the 16 4-bit FP4 values plus the single E4M3 scale.
Double-block scaling (TetraJet-v2): For improved utilization of the scaling range,
- An outer scale is computed over a larger block (e.g., 128 elements) to compress the global dynamic range.
- Each inner block of 16 then has an additional local scale .
- Final quantization applies both scales:
This hierarchical scaling provides better coverage of both global and local outlier variation (Chen et al., 31 Oct 2025).
Unbiased SGD: Stochastic rounding () is used in the backward pass, while round-to-nearest is used in the forward pass. This guarantees , enabling unbiased stochastic gradient descent (SGD) essential for training stability (Chen et al., 31 Oct 2025, Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025, Wang et al., 28 Jan 2025).
3. Outlier Handling: Random Hadamard Transform and High-Precision Retention
Quantization error in NVFP4 is dominated by outliers and groupwise maxima, especially in transformer activations and gradients. Distinct approaches have been proposed:
- Random Hadamard Transform (RHT): Before quantization, multiply the relevant tensor (typically gradients or activations during backpropagation) by a random (diagonal-signed) Hadamard matrix. This spreads outlier mass across the block, reducing per-channel variance and making the distribution better suited to the limited E2M1 grid (Chen et al., 31 Oct 2025, NVIDIA et al., 29 Sep 2025). After quantization and GEMM, the inverse transform is applied. RHT is highly effective for block sizes , and overhead is minimal in hardware.
- Static high-precision retention: Persistent outlier channels, detected via offline profiling, are kept in higher precision (e.g., BF16/FP8), while the remainder use NVFP4. This is applied to a small fraction (e.g., the top channels in a tensor) (Chen et al., 31 Oct 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025).
Table: Summary of Outlier Mitigation Components
| Method | Source | Apply to | Principle |
|---|---|---|---|
| RHT | (Chen et al., 31 Oct 2025) | Gradients | Spread outlier variance, bound quant noise |
| Static Retention | (Chen et al., 31 Oct 2025) | Forward/Back | Profiled high-norm channels kept in HF |
4. Oscillation Suppression and Rounding Strategies
NVFP4 quantization at extremely low learning rates can induce bin-boundary oscillations ("ping-ponging") of weights between adjacent FP4 bins, degrading convergence.
- OsciReset (TetraJet-v2): This method tracks, per weight, the ratio of FP4 quantization jumps to the true weight movement over a sliding window:
If this risk exceeds a threshold, the weight is recentered at the midpoint of its current FP4 bin (Chen et al., 31 Oct 2025).
Rounding strategy and SGD unbiasedness: NVFP4 implementations employ round-to-nearest (RTN) in the forward path (minimize representational error) and stochastic rounding (SR) in the backward path (unbiased gradient estimates). Empirically, SR on gradients alone suffices for stable training; activations and weights use RTN (Chmiel et al., 25 May 2025, Chen et al., 31 Oct 2025, NVIDIA et al., 29 Sep 2025).
5. Adaptive Scaling: Four Over Six (4/6) and MSE-Optimal Schemes
Block scaling induces particularly severe quantization error for values just below the block maximum due to the coarse E2M1 step size between representable values (e.g., between FP4 4 and 6, the step size is 2). This leads to relative error spikes at high-magnitude inputs.
- Four Over Six (4/6) Adaptive Scaling: For each block, both and are considered. The block is quantized twice (once to lattice , once to ). The scale yielding lower mean-squared error (MSE) is selected for actual storage:
This modification efficiently improves representation of near-maximum values, prevents divergence in both pretraining and post-training quantization scenarios for large-scale architectures, and can be implemented as a minor extension of the standard kernel. Empirical results show universal improvements for common LLMs and substantial reductions in perplexity and accuracy gaps (Cook et al., 1 Dec 2025).
6. Empirical Performance and Model-Scale Impact
NVFP4 quantization, when integrated with advanced scaling, unbiased rounding, outlier-mitigation, and oscillation suppression, yields practical large gains for LLM training and inference.
- TetraJet-v2: On OLMo-2 LLMs up to 370M parameters and 200B tokens, NVFP4 with double-block scaling, RHT, OsciReset, and OutControl yields validation perplexities close to BF16, and downstream accuracy drops under 1% (Chen et al., 31 Oct 2025).
- Four Over Six: Applying 4/6 adaptive scaling universally prevents training divergence and closes the loss gap to BF16 on transformer and hybrid models, reduces WikiText-2 perplexity by 3–7%, and boosts downstream accuracy by up to 1 point with negligible runtime overhead (Cook et al., 1 Dec 2025).
- End-to-end LLM training: Training with NVFP4 (and stochastic rounding + RHT + partial high-precision layers as needed) achieves within 1–2% of FP8/BF16 baseline accuracy, with stable convergence on models of scale up to 12 B and 10 T tokens (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025). MXFP4 requires up to 36% more tokens to reach comparable loss (NVIDIA et al., 29 Sep 2025).
- Inference speedups: Direct NVFP4 execution on Blackwell GPUs delivers up to speedup (single linear layer), and $2.0$– end-to-end speedup on LLM inference (Egiazarian et al., 27 Sep 2025, Liu et al., 4 Aug 2025).
7. Hardware, Mixed-Precision, and Comparative Context
NVFP4 is directly supported by Blackwell-class NVIDIA GPUs and is the canonical low-bit format for 4-bit microscaling in the hardware and software ecosystem. Key low-level details include:
- Hardware implementation: Each NVFP4 block of sixteen 4-bit E2M1 values is stored with an E4M3 scale. Processing units perform blockwise dequantization and accumulation in FP16/FP32, with tightly integrated rounding and scale application. Optimized CUDA and CUTLASS kernels fuse quantization, dequantization, and matrix multiplication while utilizing FP4 instruction sets (Liu et al., 4 Aug 2025, Egiazarian et al., 27 Sep 2025).
- Comparison to INT4 and MXFP4: For small block sizes, NVFP4 typically outperforms INT4 and MXFP4 on both accuracy and stability, barring advanced outlier mitigation for INT4 (e.g., Hadamard rotations) (Chen et al., 29 Oct 2025, Egiazarian et al., 27 Sep 2025). For hardware area and energy, NVFP4 improves over larger FP scalar formats but is moderately less efficient than NVINT4.
- Mixed-precision integration: For robustness, NVFP4 is frequently deployed with selective retention of FP8/BF16 in a small set of high-impact layers or persistent outlier channels (Chen et al., 31 Oct 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025). These strategies allow >30% weight memory reduction at <1% accuracy loss on major LLMs.
- Software ecosystem and kernels: Production kernels support fused rotation, scale search (MSE-optimized grids), static activation permutation, and blockwise quantization on NVFP4. Recent quantization frameworks (e.g., MR-GPTQ, TetraJet-v2, MicroMix) specifically optimize for FP4 hardware constraints and accuracy restoration (Egiazarian et al., 27 Sep 2025, Chen et al., 31 Oct 2025, Liu et al., 4 Aug 2025).
In summary, the NVIDIA FP4 (NVFP4) quantization algorithm—by combining small-block E2M1 floating-point values, high-resolution block E4M3 scaling, unbiased stochastic quantization, algorithmic oscillation suppression (OsciReset), adaptive scaling (4/6), and hybrid outlier control—restores near-full-precision convergence and accuracy for LLM training and inference at 4-bit cost. The format is now foundational in Blackwell platforms and informs the algorithm–hardware co-design of next-generation low-precision AI accelerators (Chen et al., 31 Oct 2025, Cook et al., 1 Dec 2025, NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, Egiazarian et al., 27 Sep 2025).