NVFP4: 4-Bit Block-Scaled FP Format
- NVFP4 is a 4-bit block-scaled floating-point format that partitions data into blocks with shared quantized scale factors, enabling efficient low-precision neural network operations.
- Its design introduces nonuniform quantization error where small-magnitude values are finely resolved and near-maximal values incur larger errors, impacting performance metrics.
- Adaptive extensions like IF4, MixFP4, and power-of-two grids enhance stability and accuracy, while native hardware support on NVIDIA GPUs optimizes training and inference efficiency.
NVFP4 is a 4-bit block-scaled floating-point format, widely adopted as the numeric foundation for low-precision deep learning in NVIDIA Blackwell GPUs and related hardware. It is engineered to deliver extreme arithmetic and memory efficiency while preserving sufficient representational fidelity required for training and inference of large-scale neural networks, especially LLMs. The NVFP4 design employs block-local floating point scaling, enabling native hardware support for full W4A4 quantization pipelines. Despite these advantages, the error geometry and nonuniformity of NVFP4 quantization have motivated recent advances in adaptive data types, grid mixing, and quantization-aware training strategies.
1. NVFP4 Format Specification and Arithmetic
NVFP4 is an E2M1 “FP4” (floating-point 4-bit, 1 sign, 2 exponent, 1 mantissa) microformat. For efficient dynamic range, it block-partitions weights/activations into groups of 16; each block is multipled by a shared, quantized 8-bit FP8 (E4M3) scale factor, and a tensor-wide FP32 scalar. The dequantization path is thus
where is the signed 4-bit E2M1 value, is the block scale (E4M3, bias 7), and is the global FP32 tensor scale (Cook et al., 30 Mar 2026). The FP4 data path is decoded as
Significantly, the block scale is derived by normalizing block max:
with for E2M1. This method guarantees that the highest-magnitude value in each block utilizes the full FP4 codebook without global outlier inflation. All block-scale are rounded to the nearest E4M3 value.
2. Quantization Error and Its Distribution
While block-wise scaling in NVFP4 increases representational fidelity, it introduces a highly non-uniform quantization error profile. Small-magnitude values enjoy fine quantization (FP4 step near zero is ), but near-maximal values sustain disproportionate absolute error (FP4’s largest step is 0, e.g., between codes 4 and 6) (Cook et al., 30 Mar 2026). Empirically, this results in a “long tail” of high-MSE blocks, dominated by blocks containing outlier values. Explicitly, for block 1, the empirical MSE is:
2
Blocks with outliers account for most of the aggregate quantization loss.
3. Adaptive and Mixed-Grid Extensions
Several adaptive block-scaling schemes improve on baseline NVFP4 by dynamically selecting quantization grids per block, reducing systemic error concentration:
- IF4 (Int/Float 4): For each block, selects between E2M1 FP4 and signed INT4 (uniform grid), using block MSE to choose representation. The E4M3 scale's sign bit is repurposed as a mode indicator, incurring zero metadata overhead. Dequantization automatically applies a 3 factor for INT4 blocks. IF4 achieves lower pretraining loss and higher PTQ accuracy than standard NVFP4, reducing block MSE (e.g., N(0,1) error drops from 4 to 5) (Cook et al., 30 Mar 2026).
- MixFP4: Generalizes to mixed microformats: for each block, selects between E2M1 (FP4) and E1M2 (INT-like FP4), again storing the mode bit in the E4M3 scale’s sign. Both formats are decoded to E2M2 as the internal compute representation (Zou et al., 29 May 2026). MixFP4 improves quantization robustness and accuracy, yielding best-in-class perplexity on LLMs with only a ~3% area, 1.5% power tensor-core overhead.
- Power-of-Two Grids (SFP4/PO2): Permits selection among multiple shifted or asymmetric FP4 grids, storing the format choice in the top bits of the scale byte. Up to three grids can be supported (e.g., centered and ±0.5 shifted), with each block picking the argmin-MSE grid (Egiazarian et al., 12 May 2026). This approach reduces blockwise MSE by 11–21% relative to pure NVFP4 and approaches the theoretical lower bound for per-block error with modest per-block metadata overhead.
- "Four Over Six" (4/6): Instead of always scaling blocks to fit max magnitude to FP4’s 6, the 4/6 method checks both scale-to-6 and scale-to-4 quantizations, selecting the lower-error configuration per block (Cook et al., 1 Dec 2025). This simple adaptive rule curbs error for near-maximal values and dramatically improves pretraining stability and PTQ accuracy.
4. Quantization-Aware Training, Distillation, and Error Mitigation
Achieving high accuracy from NVFP4 quantization requires both improved quantizers and quantization-aware training or recovery methods:
- Stochastic Rounding (SR) and MS-EDEN: NVFP4-based training recipes employ stochastic rounding for unbiased gradient estimation, and random Hadamard transforms (RHT) to mitigate block-level outlier error in weight gradients. MS-EDEN further halves gradient quantization error versus standard SR, leveraging RHT with unbiased group-scaling and stochastic group-scale rounding (Panferov et al., 30 Jan 2026). These techniques are integrated into recipes such as Quartet II.
- Double-block scaling and OutControl: Double-block scaling applies scaling at both the 128- and 16-element levels; OutControl identifies persistent outlier channels and stores them at higher precision (Chen et al., 31 Oct 2025). OsciReset suppresses oscillatory quantization noise in ill-behaved weight trajectories. Ablations confirm that each component is critical for minimizing the full-precision–NVFP4 gap.
- Quantization-Aware Distillation (QAD): KL-divergence-based distillation from a full-precision teacher significantly narrows the performance gap for PTQ-quantized NVFP4 models across SFT, RL, and mixture training. For LLMs, QAD achieves accuracy recovery within 0.1–1% of BF16, outperforming standard QAT and PTQ (Xin et al., 27 Jan 2026). Recent advances advocate CKA-guided loss regularization (CKA-QAD) to mitigate representational drift and explicitly preserve internal feature geometry, which is essential for reasoning/coding accuracy post-quantization (Tu et al., 4 Jun 2026).
5. Hardware Implementations and Inference Pipelines
Native NVFP4 support in Blackwell GPUs enables highly efficient compute implementations:
- Tensor Core Execution: NVFP4 matrix-multiply (MMA/GEMM) is performed with 4-bit operand blocks and E4M3 scaling. Both weights and activations can be quantized (W4A4), allowing 2–4× reduction in bandwidth and up to 3× increase in peak throughput relative to FP8 (NVIDIA et al., 24 Dec 2025). Internal accumulations are performed in FP16/BF16 to avoid catastrophic loss.
- MAC Circuitry: The IF4 MAC unit (for IF4 adaptive quantization) demonstrates that blockwise adaptive type selection incurs only modest hardware overhead: +4.7% latency, +66% area, +27.8% power (datapath only), which is amortized in full-chip context (Cook et al., 30 Mar 2026).
- Edge Inference: The NVLUT architecture decomposes FP4 multiplications into XOR, integer add, and LUT-based mantissa multiplies; two-level scaling and ECC-protected sign/exponent bits enable >20× energy savings and >1.5× area reduction for edge devices (Sen et al., 3 Jun 2026).
- Prefill/Decode Splitting: For LLM agents, aggressive NVFP4 quantization is used for parallel-prefilling (memory/compute bottleneck), while decoding falls back to BF16 to avoid snowballing error in autoregressive trajectories. This phase-aware strategy yields 2–3× speedup with <4% task performance loss (Lu et al., 19 May 2026).
- Small-Batch GEMV: In low-latency settings, such as LLM step-wise decoding, CUDA-core small-M kernels recover 1.7–2.5× kernel-level speedups, capitalizing on the bandwidth and arithmetic advantage of NVFP4 even for small batch sizes (Lee et al., 11 Jun 2026).
6. Empirical Performance, Sensitivity, and Deployment Guidelines
NVFP4-based quantization is now evaluated extensively across pretraining, PTQ, QAT, LLM distillation, agentic inference, and edge deployment:
- Training/Inference Stability: Pretraining instability and divergence observed with baseline NVFP4 can be resolved with 4/6 scaling, IF4/MixFP4, or dual-grid methods (Cook et al., 1 Dec 2025, Cook et al., 30 Mar 2026). Recipes such as TetraJet-v2 and Quartet II combine block-adaptive quantization, unbiased stochastic backward estimation, and outlier control to close up to 51% of the FP4–FP32 loss gap (Chen et al., 31 Oct 2025, Panferov et al., 30 Jan 2026).
- Component Sensitivity: Systematic diagnostic studies show that NVFP4 sensitivity is largest in MLP up/down-projections; attention projections are substantially less critical, and block-wise sensitivity is heaviest in the final layers (Cim et al., 5 Mar 2026). Mixed-precision schedules prioritizing full precision for a small set of blocks/components achieve near-BF16 performance at minor memory overhead.
- Post-Training Quantization (PTQ): MR-GPTQ (Hadamard-rotated, NVFP4-optimized GPTQ) and SOAR (joint scale optimization + decoupled scale search) yield state-of-the-art PTQ results, with up to 2% zero-shot accuracy gain over classical NVFP4, and outperform MXFP4 and INT4 baselines (Egiazarian et al., 27 Sep 2025, Bao et al., 12 May 2026). ARCQuant recovers the error profile of 8-bit quantization with only unified NVFP4 4-bit artifacts via residual channel augmentation (Meng et al., 12 Jan 2026).
- Downstream and Throughput Results: NVFP4 models track BF16 and FP8 LLMs within 1–1.5% validation loss over tens of billions of tokens on models up to 12–15B parameters, with near-perfect recovery on SFT and RL fine-tuned downstream task benchmarks (NVIDIA et al., 29 Sep 2025, NVIDIA et al., 24 Dec 2025). Production deployment sees 2–4× memory and compute reduction, validated in both long-context LLMs and 720p video generation (Chen et al., 18 May 2026).
7. Future Directions and Open Challenges
Ongoing investigation targets the remaining tradeoffs and limitations of NVFP4 quantization:
- The development of even more adaptive grids and finer block sizes (e.g., block size 8), as in IF3/IF6 and multi-grid SFP4, continues to push representational efficiency (Cook et al., 30 Mar 2026, Egiazarian et al., 12 May 2026).
- Theoretical analysis of block size, bit-width, and scaling depth for new IF and PO2 grid families.
- Co-designing quantization methods with layer-local mixed-precision and task-adaptive block format switching (Cook et al., 30 Mar 2026).
- Embedding NVFP4 or IF4 MACs as integral sub-components in streaming, low-latency inference and generation engines.
- Exploring quantization-aware distillation methods further tuned to preserve hidden-state manifolds and feature geometry (e.g., via CKA regularization) to fully close the gap on challenging code and reasoning tasks (Tu et al., 4 Jun 2026).
NVFP4, with its block-scaled E2M1 microformat, two-level scaling, and suites of quantization-aware algorithms, currently delineates the state of the art for extreme low-precision deep learning with maintained accuracy, efficiency, and hardware compatibility in both research and production deployments (Cook et al., 30 Mar 2026).