NVFP4 Quantization Algorithm Overview
- NVFP4 is a 4-bit floating-point quantization algorithm that uses a hybrid scheme combining per-block FP8 scaling with FP4 values to reduce memory usage and boost computation speed.
- It employs blockwise quantization with a global FP32 scale and per-block FP8 scales, achieving significant throughput improvements and minimal accuracy loss.
- Advanced techniques like Four Over Six adaptive scaling and MR-GPTQ error compensation help mitigate quantization noise, ensuring stable training and inference performance.
NVIDIA FP4 (NVFP4) Quantization Algorithm
NVIDIA FP4 (NVFP4) is a hardware-supported, fine-grained, blockwise 4-bit floating-point (FP) quantization format designed for both efficient training and inference of LLMs on modern accelerators, particularly NVIDIA Blackwell GPUs. NVFP4 employs a hybrid numerical format, combining per-block FP8 “microscaling” with E2M1 FP4 values, delivering substantial memory reduction and compute speedup while maintaining minimal accuracy degradation when applied with appropriate quantization strategies and algorithmic safeguards.
1. NVFP4 Number Representation and Scaling Structure
NVFP4 encodes tensor elements using the following scheme:
- FP4 Value (E2M1): Each element is stored as a 4-bit value with 1 sign bit, 2 exponent bits (exponent bias 1), and 1 mantissa bit. The codebook of representable values is:
- Per-Block FP8 E4M3 Scale: Each group of 16 elements (“block”) shares an FP8 (E4M3: exponent 4, mantissa 3) scaling factor , dynamic range .
- Per-Tensor Global FP32 Scale: Some implementations apply an additional global FP32 scale, especially for global normalization or ultra-high dynamic range.
The quantized value for an element in block is:
where is the tensor-wide scale and is the block’s FP8 scale. The corresponding real value after dequantization is
Bit-packed storage places two 4-bit values in each byte. The layout is optimized for rapid access by custom CUDA/Triton kernels and Blackwell tensor core support (Cook et al., 1 Dec 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025, Chen et al., 29 Oct 2025, Chen et al., 31 Oct 2025, Egiazarian et al., 27 Sep 2025).
2. Quantization and Dequantization Procedure
2.1 Standard NVFP4 Block Quantization
For tensor :
- Global scale (optional):
typical , .
- Blockwise scale: For block ,
is then quantized to FP8 E4M3.
- Element quantization: For in block , compute , round to the nearest representable FP4 level.
- Bit-packing: Store two 4-bit quantized values per byte plus one 8-bit scale per block.
2.2 Quantization Mapping
Given a real value in block with per-block scale :
where enforces the codebook range, and “round” implements either nearest or stochastic rounding (the latter for unbiased gradient propagation in training).
2.3 Block and Tensor Granularity
- Block size is consistently elements.
- For weights: 2D 1616 block scaling is typical, providing forward–backward quantization chain-rule consistency (NVIDIA et al., 29 Sep 2025).
- Activations and gradients often use 116 block scaling.
3. Error Characteristics and Outlier Management
3.1 Quantization Noise
- Static noise per weight:
- Noise variance (in uniform error approximation): For step-size ,
Blockwise empirical variance is measured as
- Effect on downstream tasks: Quantization noise flattens the logits’ distribution, empirically increasing policy entropy in RL settings, which encourages greater exploration (Huang et al., 13 Oct 2025).
3.2 Outlier Mitigation Strategies
- Random Hadamard Transform (RHT): Used on gradient paths (esp. weight-gradient) to spread activation outliers and reduce the impact on blockwise scaling (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
- OutControl (TetraJet-v2): Static identification of outlier channels (via norm ranking) routed to higher precision. Random Hadamard Transform and higher-precision channels combine to reduce overall error (Chen et al., 31 Oct 2025).
- FGMP Policy: Blocks with large Fisher information are retained in FP8, while non-critical blocks are quantized to NVFP4, determined by sensitivity-weighted error aggregation (Hooper et al., 19 Apr 2025).
Effectiveness Caveat
Standard transform-based outlier mitigation (e.g., random rotation) is not beneficial (and can be detrimental) at NVFP4’s block size : Lemma 2.1 of (Egiazarian et al., 27 Sep 2025) shows blockwise Hadamard rotations increase top-element MSE for these small blocks. Instead, MR-GPTQ exploits format-specific compensations to benefit from blockwise rotations only within a compensation loop.
4. Extensions: Four Over Six and Double-Block Scaling
4.1 Four Over Six (4/6) Adaptive Block Scaling
Large quantization errors in NVFP4 arise for values near the maximal codebook value (scaled ). The Four Over Six algorithm, introduced in (Cook et al., 1 Dec 2025), addresses this by adaptively choosing between:
For each block, 4/6 quantizes both ways and selects the one yielding lower mean squared error (MSE) to the original values. This reduces large discretization jumps for near-maximal values, stabilizes training, and improves PTQ results by up to 19.9% gap reduction to BF16 under AWQ recipes, with minimal computational overhead and no extra memory.
4.2 Double-Block Quantization
TetraJet-v2 (Chen et al., 31 Oct 2025) introduces an unbiased double-block quantization method:
- Outer block:
- Micro-blocks: within each outer block
- Two-level scaling: Outer global scale and micro-block scale
Deterministic rounding in forward and stochastic rounding in backward ensure unbiased gradients. This improves stability and narrows the perplexity gap to full precision, especially when coupled with oscillation and outlier controls.
5. Implementation: Algorithm, Pseudocode, and Hardware
5.1 Core Quantization/Dequantization
Pseudocode for Standard NVFP4 (see also (Huang et al., 13 Oct 2025, Chen et al., 31 Oct 2025, Cook et al., 1 Dec 2025)):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
function NVFP4_Quantize(W: FP32 tensor, B=16): S_fp32 = max(abs(W)) / qmax_FP4 for each block b in W (size B): S_E4M3[b] = FP8_E4M3( max(abs(block)) / qmax_FP4 ) for element i in block: code = round( block[i] / (S_fp32 * S_E4M3[b]) ) tildeW[i] = clamp( code, code_min, code_max ) packed = pack_nibbles(tildeW) return packed, S_fp32, S_E4M3 function NVFP4_Dequant(packed, S_fp32, S_E4M3): tildeW = unpack_nibbles(packed) for b in 0..len(S_E4M3)-1: tildeW[b*B:(b+1)*B] *= (S_fp32 * S_E4M3[b]) return tildeW as FP32 tensor |
5.2 GPU and Datapath Details
- Custom CUDA/Triton kernels (e.g. Marlin, QuTLASS) fuse unpacking, scale application, matmul, and support efficient NVFP4 deployment on Blackwell SM100/SM120 Tensor Cores (Cook et al., 1 Dec 2025, Egiazarian et al., 27 Sep 2025). Fused epilogues apply quantization, scaling, and packing within matrix multiply camps.
- Two values per byte are packed for memory efficiency.
- On-device per-block scale quantization leverages GPU “cvt.rn” instructions for FP32→FP8 and packed FP4 conversion.
- FGMP hardware: PEs support mixed FP8/FP4 VMAC datapaths, and a lightweight post-processing unit for quantization error-based block format selection (Hooper et al., 19 Apr 2025).
6. Training, PTQ, and RL Integration
6.1 Fully Quantized Training
- TetraJet-v2 (Chen et al., 31 Oct 2025): End-to-end 4-bit training with NVFP4 for all linear layer operands; employs unbiased double-block quantization and suppression techniques (OsciReset, OutControl) for near-lossless convergence.
- Random Hadamard Transform (RHT): Applied before weight-gradient matmuls only, improves quantization statistics for heavy-tailed outlier-dominated blocks (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
- Selective High-Precision Layers: First/last FFN blocks and other sensitive operations are retained in BF16/FP32 for stability (NVIDIA et al., 29 Sep 2025).
6.2 Post-Training Quantization (PTQ) and Specialized Algorithms
- Micro-Rotated-GPTQ (MR-GPTQ): Format-specialized GPTQ for NVFP4. Employs block-wise Hadamard fused rotations, static activation reordering, and MSE-optimized scale/grid search. Delivers state-of-the-art recovery (96–99%) and hardware-efficient deployment (Egiazarian et al., 27 Sep 2025).
- Four Over Six (4/6): Drop-in for any NVFP4-based PTQ method (AWQ, GPTQ, RTN, SmoothQuant), consistently reducing BF16∼NVFP4 performance gap with negligible runtime overhead (Cook et al., 1 Dec 2025).
6.3 LoRA, RL, and Exploration
- QeRL (Huang et al., 13 Oct 2025): Combines NVFP4-quantized backbone with LoRA adapters in full precision, exploiting quantization noise to raise policy entropy. Adaptive Quantization Noise (AQN) further enhances RL exploration by scheduling dynamic Gaussian noise into LayerNorm scaling, accelerating reward discovery.
7. Empirical Benchmarks and Real-World Performance
7.1 Model Performance and Scaling
A selection of key results (metric values exactly as reported):
| Scenario | Method/Model | Perplexity / Accuracy / Speedup |
|---|---|---|
| Mem. use (7B) | BF16 LoRA | ≈15.2 GB |
| QLoRA (NF4+LoRA) | ≈5.7 GB (37%) | |
| QeRL (NVFP4+LoRA) | ≈5.9 GB (39%) | |
| GSM8K, QeRL RL final accuracy | BF16 LoRA | 88.1% |
| QLoRA | 85.0% | |
| NVFP4+LoRA no AQN | 88.5% | |
| NVFP4+LoRA+AQN | 90.8% | |
| RL rollout throughput | BF16 LoRA (14B) | ≈65 tokens/s |
| NVFP4 QeRL (14B) | ≈95 tokens/s (1.46×) | |
| Pretraining (12B, 10T tokens) | NVFP4 | Valid loss within 1–1.5% of FP8 |
| PTQ, Wikitext-103 (7B) | FP8 | 5.06 |
| NVFP4 | 5.18 | |
| FGMP 70% NVFP4 | 5.11 | |
| Downstream zero-shot, Llama-3.1-8B-Instruct | FP16 | 78.93 (average) |
| NVFP4 MR-GPTQ | 75.84 (96.1% recovery) | |
| Hardware efficiency | NVFP4 vs MXFP8 | 0.55× energy/area |
7.2 Hardware and Efficiency
- NVFP4 delivers up to 3× higher throughput and 2× lower memory cost versus FP8 microscaling, with ~4× storage and MAC energy reduction compared to BF16 (Chen et al., 31 Oct 2025, Chen et al., 29 Oct 2025, Hooper et al., 19 Apr 2025).
- No additional memory overhead for advanced scaling (e.g. 4/6 adaptive) (Cook et al., 1 Dec 2025).
8. Limitations, Comparisons, and Recommendations
8.1 Trade-offs and Context
- At block size 16, blockwise floating-point (NVFP4) often outperforms INT4 (NVINT4) on unrotated quantization, but with Hadamard rotation NVINT4 can surpass NVFP4 (Chen et al., 29 Oct 2025, Egiazarian et al., 27 Sep 2025).
- NVFP4 is not universally superior to MXFP4/INT4; format-specific algorithms and scale search are essential for optimal results.
8.2 Practical Deployment
- Employ NVFP4 (G=16, E4M3 per-block scale) with MR-GPTQ for top accuracy and hardware compatibility.
- Integrate Four Over Six for enhanced robustness in PTQ and training, especially on models exhibiting heavy-tailed block distributions.
- For RL, use NVFP4+AQN to combine memory/computation gains with improved exploration.
- Retain ≈10–15% of layers in higher precision for critical network operations.
- Avoid generic outlier mitigation transforms for NVFP4; rely on static outlier retention policies and MR-GPTQ’s internal Hadamard compensation.
9. References
- QeRL: Quantization-enhanced RL for LLMs (Huang et al., 13 Oct 2025)
- TetraJet-v2: NVFP4 training/oscillation/outlier control (Chen et al., 31 Oct 2025)
- Four Over Six: Adaptive NVFP4 scaling (Cook et al., 1 Dec 2025)
- Pretraining LLMs with NVFP4 (NVIDIA et al., 29 Sep 2025)
- FGMP: Mixed-precision quantization, hardware datapath (Hooper et al., 19 Apr 2025)
- INT vs. FP: Quantization format paper (Chen et al., 29 Oct 2025)
- Micro-Rotated-GPTQ for NVFP4 (Egiazarian et al., 27 Sep 2025)