Papers
Topics
Authors
Recent
2000 character limit reached

NVFP4 Quantization Algorithm Overview

Updated 2 December 2025
  • NVFP4 is a 4-bit floating-point quantization algorithm that uses a hybrid scheme combining per-block FP8 scaling with FP4 values to reduce memory usage and boost computation speed.
  • It employs blockwise quantization with a global FP32 scale and per-block FP8 scales, achieving significant throughput improvements and minimal accuracy loss.
  • Advanced techniques like Four Over Six adaptive scaling and MR-GPTQ error compensation help mitigate quantization noise, ensuring stable training and inference performance.

NVIDIA FP4 (NVFP4) Quantization Algorithm

NVIDIA FP4 (NVFP4) is a hardware-supported, fine-grained, blockwise 4-bit floating-point (FP) quantization format designed for both efficient training and inference of LLMs on modern accelerators, particularly NVIDIA Blackwell GPUs. NVFP4 employs a hybrid numerical format, combining per-block FP8 “microscaling” with E2M1 FP4 values, delivering substantial memory reduction and compute speedup while maintaining minimal accuracy degradation when applied with appropriate quantization strategies and algorithmic safeguards.

1. NVFP4 Number Representation and Scaling Structure

NVFP4 encodes tensor elements using the following scheme:

  • FP4 Value (E2M1): Each element is stored as a 4-bit value with 1 sign bit, 2 exponent bits (exponent bias 1), and 1 mantissa bit. The codebook of representable values is:

{0,±0.5,±1,±1.5,±2,±3,±4,±6}\{0, \pm0.5, \pm1, \pm1.5, \pm2, \pm3, \pm4, \pm6\}

  • Per-Block FP8 E4M3 Scale: Each group of 16 elements (“block”) shares an FP8 (E4M3: exponent 4, mantissa 3) scaling factor Δj\Delta_j, dynamic range [448,+448][-448, +448].
  • Per-Tensor Global FP32 Scale: Some implementations apply an additional global FP32 scale, especially for global normalization or ultra-high dynamic range.

The quantized value for an element xix_i in block jj is:

xˉi=QuantFP4(xiαΔj)\bar{x}_i = \mathrm{Quant}_{\text{FP4}}\left( \frac{x_i}{\alpha \Delta_j} \right )

where α\alpha is the tensor-wide scale and Δj\Delta_j is the block’s FP8 scale. The corresponding real value after dequantization is

x^i=xˉiαΔj\hat{x}_i = \bar{x}_i \cdot \alpha \cdot \Delta_j

Bit-packed storage places two 4-bit values in each byte. The layout is optimized for rapid access by custom CUDA/Triton kernels and Blackwell tensor core support (Cook et al., 1 Dec 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025, Chen et al., 29 Oct 2025, Chen et al., 31 Oct 2025, Egiazarian et al., 27 Sep 2025).

2. Quantization and Dequantization Procedure

2.1 Standard NVFP4 Block Quantization

For tensor XX:

  1. Global scale (optional):

α=maxXMFP4MFP8\alpha = \frac{\max |X|}{M^{\mathrm{FP4}} M^{\mathrm{FP8}}}

typical MFP4=6M^{\mathrm{FP4}}=6, MFP8=448M^{\mathrm{FP8}}=448.

  1. Blockwise scale: For block jj,

Δj=maxiblockjXiαMFP4\Delta_j = \frac{\max_{i \in \text{block}_j} |X_i|}{\alpha \cdot M^{\mathrm{FP4}}}

Δj\Delta_j is then quantized to FP8 E4M3.

  1. Element quantization: For XiX_i in block jj, compute y=Xi/(αΔj)y = X_i/(\alpha\Delta_j), round to the nearest representable FP4 level.
  2. Bit-packing: Store two 4-bit quantized values per byte plus one 8-bit scale per block.

2.2 Quantization Mapping

Given a real value xx in block vv with per-block scale ss:

x^=QNVFP4(x;s)=sround(clip(x/s,qmin,qmax))\hat{x} = Q_{\text{NVFP4}}(x; s) = s \cdot \mathrm{round}(\mathrm{clip}(x/s, q_{\min}, q_{\max}))

where clip\mathrm{clip} enforces the codebook range, and “round” implements either nearest or stochastic rounding (the latter for unbiased gradient propagation in training).

2.3 Block and Tensor Granularity

  • Block size is consistently B=16B=16 elements.
  • For weights: 2D 16×\times16 block scaling is typical, providing forward–backward quantization chain-rule consistency (NVIDIA et al., 29 Sep 2025).
  • Activations and gradients often use 1×\times16 block scaling.

3. Error Characteristics and Outlier Management

3.1 Quantization Noise

  • Static noise per weight:

Δϵ=W^W=SFP32SbE4M3W~W\Delta\epsilon = \hat{W} - W = S^{FP32} S_b^{E4M3} \tilde{W} - W

  • Noise variance (in uniform error approximation): For step-size qb=SFP32SbE4M3ULPFP4q_b = S^{FP32} S_b^{E4M3} \mathrm{ULP}_{\mathrm{FP4}},

Var[Δϵi]qb212\mathrm{Var}[\Delta \epsilon_i] \approx \frac{q_b^2}{12}

Blockwise empirical variance is measured as

σb2=1bib(W^iWi)2\sigma_b^2 = \frac{1}{|b|} \sum_{i \in b} (\hat{W}_i - W_i)^2

  • Effect on downstream tasks: Quantization noise flattens the logits’ distribution, empirically increasing policy entropy in RL settings, which encourages greater exploration (Huang et al., 13 Oct 2025).

3.2 Outlier Mitigation Strategies

  • Random Hadamard Transform (RHT): Used on gradient paths (esp. weight-gradient) to spread activation outliers and reduce the impact on blockwise scaling (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
  • OutControl (TetraJet-v2): Static identification of outlier channels (via 2\ell_2 norm ranking) routed to higher precision. Random Hadamard Transform and higher-precision channels combine to reduce overall error (Chen et al., 31 Oct 2025).
  • FGMP Policy: Blocks with large Fisher information are retained in FP8, while non-critical blocks are quantized to NVFP4, determined by sensitivity-weighted error aggregation (Hooper et al., 19 Apr 2025).

Effectiveness Caveat

Standard transform-based outlier mitigation (e.g., random rotation) is not beneficial (and can be detrimental) at NVFP4’s block size G=16G=16: Lemma 2.1 of (Egiazarian et al., 27 Sep 2025) shows blockwise Hadamard rotations increase top-element MSE for these small blocks. Instead, MR-GPTQ exploits format-specific compensations to benefit from blockwise rotations only within a compensation loop.

4. Extensions: Four Over Six and Double-Block Scaling

4.1 Four Over Six (4/6) Adaptive Block Scaling

Large quantization errors in NVFP4 arise for values near the maximal codebook value (scaled y4..6y\approx4..6). The Four Over Six algorithm, introduced in (Cook et al., 1 Dec 2025), addresses this by adaptively choosing between:

  • Δj(6)=maxX/6\Delta_j^{(6)} = \max|\mathbf{X}|/6
  • Δj(4)=maxX/4\Delta_j^{(4)} = \max|\mathbf{X}|/4

For each block, 4/6 quantizes both ways and selects the one yielding lower mean squared error (MSE) to the original values. This reduces large discretization jumps for near-maximal values, stabilizes training, and improves PTQ results by up to 19.9% gap reduction to BF16 under AWQ recipes, with minimal computational overhead and no extra memory.

4.2 Double-Block Quantization

TetraJet-v2 (Chen et al., 31 Oct 2025) introduces an unbiased double-block quantization method:

  • Outer block: 1×1281\times128
  • Micro-blocks: 1×161\times16 within each outer block
  • Two-level scaling: Outer global scale SglobalS_{\text{global}} and micro-block scale SblockS_{\text{block}}

Deterministic rounding in forward and stochastic rounding in backward ensure unbiased gradients. This improves stability and narrows the perplexity gap to full precision, especially when coupled with oscillation and outlier controls.

5. Implementation: Algorithm, Pseudocode, and Hardware

5.1 Core Quantization/Dequantization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function NVFP4_Quantize(W: FP32 tensor, B=16):
    S_fp32 = max(abs(W)) / qmax_FP4
    for each block b in W (size B):
        S_E4M3[b] = FP8_E4M3( max(abs(block)) / qmax_FP4 )
        for element i in block:
            code = round( block[i] / (S_fp32 * S_E4M3[b]) )
            tildeW[i] = clamp( code, code_min, code_max )
    packed = pack_nibbles(tildeW)
    return packed, S_fp32, S_E4M3

function NVFP4_Dequant(packed, S_fp32, S_E4M3):
    tildeW = unpack_nibbles(packed)
    for b in 0..len(S_E4M3)-1:
        tildeW[b*B:(b+1)*B] *= (S_fp32 * S_E4M3[b])
    return tildeW as FP32 tensor

5.2 GPU and Datapath Details

  • Custom CUDA/Triton kernels (e.g. Marlin, QuTLASS) fuse unpacking, scale application, matmul, and support efficient NVFP4 deployment on Blackwell SM100/SM120 Tensor Cores (Cook et al., 1 Dec 2025, Egiazarian et al., 27 Sep 2025). Fused epilogues apply quantization, scaling, and packing within matrix multiply camps.
  • Two values per byte are packed for memory efficiency.
  • On-device per-block scale quantization leverages GPU “cvt.rn” instructions for FP32→FP8 and packed FP4 conversion.
  • FGMP hardware: PEs support mixed FP8/FP4 VMAC datapaths, and a lightweight post-processing unit for quantization error-based block format selection (Hooper et al., 19 Apr 2025).

6. Training, PTQ, and RL Integration

6.1 Fully Quantized Training

  • TetraJet-v2 (Chen et al., 31 Oct 2025): End-to-end 4-bit training with NVFP4 for all linear layer operands; employs unbiased double-block quantization and suppression techniques (OsciReset, OutControl) for near-lossless convergence.
  • Random Hadamard Transform (RHT): Applied before weight-gradient matmuls only, improves quantization statistics for heavy-tailed outlier-dominated blocks (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
  • Selective High-Precision Layers: First/last FFN blocks and other sensitive operations are retained in BF16/FP32 for stability (NVIDIA et al., 29 Sep 2025).

6.2 Post-Training Quantization (PTQ) and Specialized Algorithms

  • Micro-Rotated-GPTQ (MR-GPTQ): Format-specialized GPTQ for NVFP4. Employs block-wise Hadamard fused rotations, static activation reordering, and MSE-optimized scale/grid search. Delivers state-of-the-art recovery (96–99%) and hardware-efficient deployment (Egiazarian et al., 27 Sep 2025).
  • Four Over Six (4/6): Drop-in for any NVFP4-based PTQ method (AWQ, GPTQ, RTN, SmoothQuant), consistently reducing BF16∼NVFP4 performance gap with negligible runtime overhead (Cook et al., 1 Dec 2025).

6.3 LoRA, RL, and Exploration

7. Empirical Benchmarks and Real-World Performance

7.1 Model Performance and Scaling

A selection of key results (metric values exactly as reported):

Scenario Method/Model Perplexity / Accuracy / Speedup
Mem. use (7B) BF16 LoRA ≈15.2 GB
QLoRA (NF4+LoRA) ≈5.7 GB (37%)
QeRL (NVFP4+LoRA) ≈5.9 GB (39%)
GSM8K, QeRL RL final accuracy BF16 LoRA 88.1%
QLoRA 85.0%
NVFP4+LoRA no AQN 88.5%
NVFP4+LoRA+AQN 90.8%
RL rollout throughput BF16 LoRA (14B) ≈65 tokens/s
NVFP4 QeRL (14B) ≈95 tokens/s (1.46×)
Pretraining (12B, 10T tokens) NVFP4 Valid loss within 1–1.5% of FP8
PTQ, Wikitext-103 (7B) FP8 5.06
NVFP4 5.18
FGMP 70% NVFP4 5.11
Downstream zero-shot, Llama-3.1-8B-Instruct FP16 78.93 (average)
NVFP4 MR-GPTQ 75.84 (96.1% recovery)
Hardware efficiency NVFP4 vs MXFP8 0.55× energy/area

7.2 Hardware and Efficiency

8. Limitations, Comparisons, and Recommendations

8.1 Trade-offs and Context

8.2 Practical Deployment

  • Employ NVFP4 (G=16, E4M3 per-block scale) with MR-GPTQ for top accuracy and hardware compatibility.
  • Integrate Four Over Six for enhanced robustness in PTQ and training, especially on models exhibiting heavy-tailed block distributions.
  • For RL, use NVFP4+AQN to combine memory/computation gains with improved exploration.
  • Retain ≈10–15% of layers in higher precision for critical network operations.
  • Avoid generic outlier mitigation transforms for NVFP4; rely on static outlier retention policies and MR-GPTQ’s internal Hadamard compensation.

9. References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to NVFP4 Quantization Algorithm.