NVFP4 Quantization Algorithm Overview

Updated 2 December 2025

NVFP4 is a 4-bit floating-point quantization algorithm that uses a hybrid scheme combining per-block FP8 scaling with FP4 values to reduce memory usage and boost computation speed.
It employs blockwise quantization with a global FP32 scale and per-block FP8 scales, achieving significant throughput improvements and minimal accuracy loss.
Advanced techniques like Four Over Six adaptive scaling and MR-GPTQ error compensation help mitigate quantization noise, ensuring stable training and inference performance.

NVIDIA FP4 (NVFP4) Quantization Algorithm

NVIDIA FP4 (NVFP4) is a hardware-supported, fine-grained, blockwise 4-bit floating-point (FP) quantization format designed for both efficient training and inference of LLMs on modern accelerators, particularly NVIDIA Blackwell GPUs. NVFP4 employs a hybrid numerical format, combining per-block FP8 “microscaling” with E2M1 FP4 values, delivering substantial memory reduction and compute speedup while maintaining minimal accuracy degradation when applied with appropriate quantization strategies and algorithmic safeguards.

1. NVFP4 Number Representation and Scaling Structure

NVFP4 encodes tensor elements using the following scheme:

FP4 Value (E2M1): Each element is stored as a 4-bit value with 1 sign bit, 2 exponent bits (exponent bias 1), and 1 mantissa bit. The codebook of representable values is:

$\{0, \pm0.5, \pm1, \pm1.5, \pm2, \pm3, \pm4, \pm6\}$

Per-Block FP8 E4M3 Scale: Each group of 16 elements (“block”) shares an FP8 (E4M3: exponent 4, mantissa 3) scaling factor $\Delta_j$ , dynamic range $[-448, +448]$ .
Per-Tensor Global FP32 Scale: Some implementations apply an additional global FP32 scale, especially for global normalization or ultra-high dynamic range.

The quantized value for an element $x_i$ in block $j$ is:

$\bar{x}_i = \mathrm{Quant}_{\text{FP4}}\left( \frac{x_i}{\alpha \Delta_j} \right )$

where $\alpha$ is the tensor-wide scale and $\Delta_j$ is the block’s FP8 scale. The corresponding real value after dequantization is

$\hat{x}_i = \bar{x}_i \cdot \alpha \cdot \Delta_j$

Bit-packed storage places two 4-bit values in each byte. The layout is optimized for rapid access by custom CUDA/Triton kernels and Blackwell tensor core support (Cook et al., 1 Dec 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025, Chen et al., 29 Oct 2025, Chen et al., 31 Oct 2025, Egiazarian et al., 27 Sep 2025).

2. Quantization and Dequantization Procedure

2.1 Standard NVFP4 Block Quantization

For tensor $X$ :

Global scale (optional):

$\alpha = \frac{\max |X|}{M^{\mathrm{FP4}} M^{\mathrm{FP8}}}$

typical $M^{\mathrm{FP4}}=6$ , $M^{\mathrm{FP8}}=448$ .

Blockwise scale: For block $j$ ,

$\Delta_j = \frac{\max_{i \in \text{block}_j} |X_i|}{\alpha \cdot M^{\mathrm{FP4}}}$

$\Delta_j$ is then quantized to FP8 E4M3.

Element quantization: For $X_i$ in block $j$ , compute $y = X_i/(\alpha\Delta_j)$ , round to the nearest representable FP4 level.
Bit-packing: Store two 4-bit quantized values per byte plus one 8-bit scale per block.

2.2 Quantization Mapping

Given a real value $x$ in block $v$ with per-block scale $s$ :

$\hat{x} = Q_{\text{NVFP4}}(x; s) = s \cdot \mathrm{round}(\mathrm{clip}(x/s, q_{\min}, q_{\max}))$

where $\mathrm{clip}$ enforces the codebook range, and “round” implements either nearest or stochastic rounding (the latter for unbiased gradient propagation in training).

2.3 Block and Tensor Granularity

Block size is consistently $B=16$ elements.
For weights: 2D 16 $\times$ 16 block scaling is typical, providing forward–backward quantization chain-rule consistency (NVIDIA et al., 29 Sep 2025).
Activations and gradients often use 1 $\times$ 16 block scaling.

3. Error Characteristics and Outlier Management

3.1 Quantization Noise

Static noise per weight:

$\Delta\epsilon = \hat{W} - W = S^{FP32} S_b^{E4M3} \tilde{W} - W$

Noise variance (in uniform error approximation): For step-size $q_b = S^{FP32} S_b^{E4M3} \mathrm{ULP}_{\mathrm{FP4}}$ ,

$\mathrm{Var}[\Delta \epsilon_i] \approx \frac{q_b^2}{12}$

Blockwise empirical variance is measured as

$\sigma_b^2 = \frac{1}{|b|} \sum_{i \in b} (\hat{W}_i - W_i)^2$

Effect on downstream tasks: Quantization noise flattens the logits’ distribution, empirically increasing policy entropy in RL settings, which encourages greater exploration (Huang et al., 13 Oct 2025).

3.2 Outlier Mitigation Strategies

Random Hadamard Transform (RHT): Used on gradient paths (esp. weight-gradient) to spread activation outliers and reduce the impact on blockwise scaling (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
OutControl (TetraJet-v2): Static identification of outlier channels (via $\ell_2$ norm ranking) routed to higher precision. Random Hadamard Transform and higher-precision channels combine to reduce overall error (Chen et al., 31 Oct 2025).
FGMP Policy: Blocks with large Fisher information are retained in FP8, while non-critical blocks are quantized to NVFP4, determined by sensitivity-weighted error aggregation (Hooper et al., 19 Apr 2025).

Effectiveness Caveat

Standard transform-based outlier mitigation (e.g., random rotation) is not beneficial (and can be detrimental) at NVFP4’s block size $G=16$ : Lemma 2.1 of (Egiazarian et al., 27 Sep 2025) shows blockwise Hadamard rotations increase top-element MSE for these small blocks. Instead, MR-GPTQ exploits format-specific compensations to benefit from blockwise rotations only within a compensation loop.

4. Extensions: Four Over Six and Double-Block Scaling

4.1 Four Over Six (4/6) Adaptive Block Scaling

Large quantization errors in NVFP4 arise for values near the maximal codebook value (scaled $y\approx4..6$ ). The Four Over Six algorithm, introduced in (Cook et al., 1 Dec 2025), addresses this by adaptively choosing between:

$\Delta_j^{(6)} = \max|\mathbf{X}|/6$
$\Delta_j^{(4)} = \max|\mathbf{X}|/4$

For each block, 4/6 quantizes both ways and selects the one yielding lower mean squared error (MSE) to the original values. This reduces large discretization jumps for near-maximal values, stabilizes training, and improves PTQ results by up to 19.9% gap reduction to BF16 under AWQ recipes, with minimal computational overhead and no extra memory.

4.2 Double-Block Quantization

TetraJet-v2 (Chen et al., 31 Oct 2025) introduces an unbiased double-block quantization method:

Outer block: $1\times128$
Micro-blocks: $1\times16$ within each outer block
Two-level scaling: Outer global scale $S_{\text{global}}$ and micro-block scale $S_{\text{block}}$

Deterministic rounding in forward and stochastic rounding in backward ensure unbiased gradients. This improves stability and narrows the perplexity gap to full precision, especially when coupled with oscillation and outlier controls.

5. Implementation: Algorithm, Pseudocode, and Hardware

5.1 Core Quantization/Dequantization

function NVFP4_Quantize(W: FP32 tensor, B=16):
    S_fp32 = max(abs(W)) / qmax_FP4
    for each block b in W (size B):
        S_E4M3[b] = FP8_E4M3( max(abs(block)) / qmax_FP4 )
        for element i in block:
            code = round( block[i] / (S_fp32 * S_E4M3[b]) )
            tildeW[i] = clamp( code, code_min, code_max )
    packed = pack_nibbles(tildeW)
    return packed, S_fp32, S_E4M3

function NVFP4_Dequant(packed, S_fp32, S_E4M3):
    tildeW = unpack_nibbles(packed)
    for b in 0..len(S_E4M3)-1:
        tildeW[b*B:(b+1)*B] *= (S_fp32 * S_E4M3[b])
    return tildeW as FP32 tensor

5.2 GPU and Datapath Details

Custom CUDA/Triton kernels (e.g. Marlin, QuTLASS) fuse unpacking, scale application, matmul, and support efficient NVFP4 deployment on Blackwell SM100/SM120 Tensor Cores (Cook et al., 1 Dec 2025, Egiazarian et al., 27 Sep 2025). Fused epilogues apply quantization, scaling, and packing within matrix multiply camps.
Two values per byte are packed for memory efficiency.
On-device per-block scale quantization leverages GPU “cvt.rn” instructions for FP32→FP8 and packed FP4 conversion.
FGMP hardware: PEs support mixed FP8/FP4 VMAC datapaths, and a lightweight post-processing unit for quantization error-based block format selection (Hooper et al., 19 Apr 2025).

6. Training, PTQ, and RL Integration

6.1 Fully Quantized Training

TetraJet-v2 (Chen et al., 31 Oct 2025): End-to-end 4-bit training with NVFP4 for all linear layer operands; employs unbiased double-block quantization and suppression techniques (OsciReset, OutControl) for near-lossless convergence.
Random Hadamard Transform (RHT): Applied before weight-gradient matmuls only, improves quantization statistics for heavy-tailed outlier-dominated blocks (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
Selective High-Precision Layers: First/last FFN blocks and other sensitive operations are retained in BF16/FP32 for stability (NVIDIA et al., 29 Sep 2025).

6.2 Post-Training Quantization (PTQ) and Specialized Algorithms

Micro-Rotated-GPTQ (MR-GPTQ): Format-specialized GPTQ for NVFP4. Employs block-wise Hadamard fused rotations, static activation reordering, and MSE-optimized scale/grid search. Delivers state-of-the-art recovery (96–99%) and hardware-efficient deployment (Egiazarian et al., 27 Sep 2025).
Four Over Six (4/6): Drop-in for any NVFP4-based PTQ method (AWQ, GPTQ, RTN, SmoothQuant), consistently reducing BF16∼NVFP4 performance gap with negligible runtime overhead (Cook et al., 1 Dec 2025).

6.3 LoRA, RL, and Exploration

QeRL (Huang et al., 13 Oct 2025): Combines NVFP4-quantized backbone with LoRA adapters in full precision, exploiting quantization noise to raise policy entropy. Adaptive Quantization Noise (AQN) further enhances RL exploration by scheduling dynamic Gaussian noise into LayerNorm scaling, accelerating reward discovery.

7. Empirical Benchmarks and Real-World Performance

7.1 Model Performance and Scaling

A selection of key results (metric values exactly as reported):

Scenario	Method/Model	Perplexity / Accuracy / Speedup
Mem. use (7B)	BF16 LoRA	≈15.2 GB
	QLoRA (NF4+LoRA)	≈5.7 GB (37%)
	QeRL (NVFP4+LoRA)	≈5.9 GB (39%)
GSM8K, QeRL RL final accuracy	BF16 LoRA	88.1%
	QLoRA	85.0%
	NVFP4+LoRA no AQN	88.5%
	NVFP4+LoRA+AQN	90.8%
RL rollout throughput	BF16 LoRA (14B)	≈65 tokens/s
	NVFP4 QeRL (14B)	≈95 tokens/s (1.46×)
Pretraining (12B, 10T tokens)	NVFP4	Valid loss within 1–1.5% of FP8
PTQ, Wikitext-103 (7B)	FP8	5.06
	NVFP4	5.18
	FGMP 70% NVFP4	5.11
Downstream zero-shot, Llama-3.1-8B-Instruct	FP16	78.93 (average)
	NVFP4 MR-GPTQ	75.84 (96.1% recovery)
Hardware efficiency	NVFP4 vs MXFP8	0.55× energy/area

7.2 Hardware and Efficiency

NVFP4 delivers up to 3× higher throughput and 2× lower memory cost versus FP8 microscaling, with ~4× storage and MAC energy reduction compared to BF16 (Chen et al., 31 Oct 2025, Chen et al., 29 Oct 2025, Hooper et al., 19 Apr 2025).
No additional memory overhead for advanced scaling (e.g. 4/6 adaptive) (Cook et al., 1 Dec 2025).

8. Limitations, Comparisons, and Recommendations

8.1 Trade-offs and Context

At block size 16, blockwise floating-point (NVFP4) often outperforms INT4 (NVINT4) on unrotated quantization, but with Hadamard rotation NVINT4 can surpass NVFP4 (Chen et al., 29 Oct 2025, Egiazarian et al., 27 Sep 2025).
NVFP4 is not universally superior to MXFP4/INT4; format-specific algorithms and scale search are essential for optimal results.

8.2 Practical Deployment

Employ NVFP4 (G=16, E4M3 per-block scale) with MR-GPTQ for top accuracy and hardware compatibility.
Integrate Four Over Six for enhanced robustness in PTQ and training, especially on models exhibiting heavy-tailed block distributions.
For RL, use NVFP4+AQN to combine memory/computation gains with improved exploration.
Retain ≈10–15% of layers in higher precision for critical network operations.
Avoid generic outlier mitigation transforms for NVFP4; rely on static outlier retention policies and MR-GPTQ’s internal Hadamard compensation.

9. References

QeRL: Quantization-enhanced RL for LLMs (Huang et al., 13 Oct 2025)
TetraJet-v2: NVFP4 training/oscillation/outlier control (Chen et al., 31 Oct 2025)
Four Over Six: Adaptive NVFP4 scaling (Cook et al., 1 Dec 2025)
Pretraining LLMs with NVFP4 (NVIDIA et al., 29 Sep 2025)
FGMP: Mixed-precision quantization, hardware datapath (Hooper et al., 19 Apr 2025)
INT vs. FP: Quantization format paper (Chen et al., 29 Oct 2025)
Micro-Rotated-GPTQ for NVFP4 (Egiazarian et al., 27 Sep 2025)

PDF Markdown Chat (Pro)

References (7)

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling (2025)

Pretraining Large Language Models with NVFP4 (2025)

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference (2025)

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats (2025)

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control (2025)

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization (2025)

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to NVFP4 Quantization Algorithm.

NVFP4 Quantization Algorithm Overview

1. NVFP4 Number Representation and Scaling Structure

2. Quantization and Dequantization Procedure

2.1 Standard NVFP4 Block Quantization

2.2 Quantization Mapping

2.3 Block and Tensor Granularity

3. Error Characteristics and Outlier Management

3.1 Quantization Noise

3.2 Outlier Mitigation Strategies

Effectiveness Caveat

4. Extensions: Four Over Six and Double-Block Scaling

4.1 Four Over Six (4/6) Adaptive Block Scaling

4.2 Double-Block Quantization

5. Implementation: Algorithm, Pseudocode, and Hardware

5.1 Core Quantization/Dequantization

Pseudocode for Standard NVFP4 (see also (Huang et al., 13 Oct 2025, Chen et al., 31 Oct 2025, Cook et al., 1 Dec 2025)):

5.2 GPU and Datapath Details

6. Training, PTQ, and RL Integration

6.1 Fully Quantized Training

6.2 Post-Training Quantization (PTQ) and Specialized Algorithms

6.3 LoRA, RL, and Exploration

7. Empirical Benchmarks and Real-World Performance

7.1 Model Performance and Scaling

7.2 Hardware and Efficiency

8. Limitations, Comparisons, and Recommendations

8.1 Trade-offs and Context

8.2 Practical Deployment

9. References

Whiteboard

Follow Topic

Continue Learning

NVFP4 Quantization Algorithm Overview

1. NVFP4 Number Representation and Scaling Structure

2. Quantization and Dequantization Procedure

2.1 Standard NVFP4 Block Quantization

2.2 Quantization Mapping

2.3 Block and Tensor Granularity

3. Error Characteristics and Outlier Management

3.1 Quantization Noise

3.2 Outlier Mitigation Strategies

Effectiveness Caveat

4. Extensions: Four Over Six and Double-Block Scaling

4.1 Four Over Six (4/6) Adaptive Block Scaling

4.2 Double-Block Quantization

5. Implementation: Algorithm, Pseudocode, and Hardware

5.1 Core Quantization/Dequantization

Pseudocode for Standard NVFP4 (see also (Huang et al., 13 Oct 2025, Chen et al., 31 Oct 2025, Cook et al., 1 Dec 2025)):

5.2 GPU and Datapath Details

6. Training, PTQ, and RL Integration

6.1 Fully Quantized Training

6.2 Post-Training Quantization (PTQ) and Specialized Algorithms

6.3 LoRA, RL, and Exploration

7. Empirical Benchmarks and Real-World Performance

7.1 Model Performance and Scaling

7.2 Hardware and Efficiency

8. Limitations, Comparisons, and Recommendations

8.1 Trade-offs and Context

8.2 Practical Deployment

9. References

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics