FP4 Quantization: Ultra-Low Precision Neural Ops
- FP4 Quantization is an ultra-low precision numerical method that replaces traditional 16/32-bit arithmetic with 4-bit operations, offering significant speed, memory, and energy advantages.
- It utilizes a two-stage quantization pipeline—combining blockwise scaling and per-element packing—with advanced strategies like Overflow-Aware Scaling and Macro Block Scaling to mitigate precision loss.
- FP4 methods have been successfully applied in large language models, diffusion, and multimodal systems, achieving near-baseline accuracy while optimizing hardware efficiency and resource utilization.
Four-bit floating-point quantization (FP4) is an ultra-low-precision numerical representation and quantization methodology that enables highly efficient deployment and training of large neural networks. FP4 achieves significant speed, memory, and energy advantages by replacing standard 16- or 32-bit floating-point arithmetic with 4-bit floating-point numbers, subject to the challenges of coarse resolution and limited dynamic range. Recent research has systematically advanced both the theoretical underpinnings and practical hardware/software systems for FP4 inference and training across LLMs, diffusion models, and multi-modal architectures.
1. FP4 Fundamentals and Quantization Formats
FP4 encompasses a family of 4-bit floating-point representations, each specified by a choice of sign (S), exponent (E), and mantissa (M) bit allocations and bias:
| Format | (S,E,M) | Bias | Value Range (normals) | Typical Block Size | Scale Format |
|---|---|---|---|---|---|
| E2M1 (NVFP4) | 1,2,1 | 16 | E4M3 (8 bits) | ||
| E2M1 (MXFP4) | 1,2,1 | $1$ | 32 | E8M0 (8 bits, pow2) | |
| E3M0 | 1,3,0 | varies | varies | ||
| IF4 | – | (int/float mux) | or FP4 | 16 | E4M3 (sign as mux flag) |
NVFP4 (NVIDIA) and MXFP4 (AMD/OCP) variants dominate hardware support. Both block macroscaling (16–32 values per scale) and fine-grained quantization are standard. Blockwise scaling is essential for keeping quantization error bounded under ultra-low bitwidths (Cim et al., 5 Mar 2026, Cook et al., 30 Mar 2026).
2. Algorithms and Quantization Schemes
Fundamental to FP4 is a two-stage quantization pipeline:
- Blockwise scaling: For block , scale is set as
with subsequent block value quantized as 0, and stored as 4 bits.
- Per-element packing: Each 4-bit result encodes a sign, exponent, and mantissa.
Several advanced strategies are developed to overcome inherent limitations:
- Overflow-Aware Scaling (OAS): Dynamically shifts scale exponent if the block maximum falls near the lower edge of a scale bin, doubling dynamic range at negligible error cost (Chhugani et al., 30 Jan 2026).
- Macro Block Scaling (MBS): Allocates enhanced precision at the macro-block (e.g., 1 sub-blocks) level to mitigate outlier-induced overflow (Chhugani et al., 30 Jan 2026).
- Adaptive Data Types (IF4): Mixes blockwise FP4/INT4 encodings depending on per-block MSE, flagged via scale sign bit, further lowering error (Cook et al., 30 Mar 2026).
- Mixed-precision allocation and thresholds: Hybridization with higher-precision channels for outlier-heavy rows/channels (as in MicroMix, HQ-DiT) (Liu et al., 4 Aug 2025, Liu et al., 2024).
- Block-wise affine and Kronecker transformations: In BATQuant, block-local, learnable affine maps re-shape input distributions without cross-block outlier spread, outperforming previous global or orthogonal rotations (Li et al., 17 Mar 2026).
The quantization process balances dynamic range, outlier suppression, and hardware efficiency, and is adapted by algorithmic innovations such as differentiable quantization estimators (Wang et al., 28 Jan 2025), mixture-of-formats layer-wise selection (Zhang et al., 2023), and scale-aware AdaRound (Chen et al., 19 Mar 2025).
3. FP4 for LLM Training and Inference
Modern LLMs exploit FP4 for weights, activations, and gradients (full W4A4G4), enabled by blockwise NVFP4 or MXFP4 quantization:
- Training: Differentiable gradient estimators (beyond naive STE), mixed-precision scheduling (e.g., FP16/BF16 for non-matmul ops), vector-wise/channel-wise scaling, and outlier clamping+compensation restore training stability. Loss gap to BF16 is reduced to 2 at 13B model scale, and zero-shot task accuracy can even improve over baseline (Wang et al., 28 Jan 2025, Chmiel et al., 25 May 2025).
- Inference: Large-scale systematic analysis shows:
- MLP up/down projections are consistently most sensitive to FP4 errors; attention projections are robust (Cim et al., 5 Mar 2026).
- Per-layer and per-channel scale selection (LLM-FP4) closes the gap with BF16/FP16 inference, e.g., LLaMA-13B suffers only 3 accuracy drop (63.1 vs 68.9), dramatically outperforming pure INT4 (Liu et al., 2023).
- Mixture-of-formats quantization (MoFQ) and MR-GPTQ boost recovery, bringing MXFP4 accuracy to 41–2% of NVFP4 (Egiazarian et al., 27 Sep 2025).
FP4 quantization is now the standard for high-throughput kernel implementations in Blackwell/TensorCore GPUs, with up to 5 throughput over FP16 and sub-1% end-to-end quality degradation (Zhang et al., 16 May 2025, Egiazarian et al., 27 Sep 2025).
4. Architectural and Hardware Considerations
The FP4 design space includes E2M1, E3M0, and extended-range UE5M3 variants, with the following trade-offs (Hu et al., 22 Sep 2025):
- NVFP4 (block 16, E4M3 scales): Optimal runtime accuracy, moderate area overhead (+12–15%), often selected.
- MXFP4 (block 32, E8M0 scales): Lower area, initially higher error unless enhanced by OAS/MBS or MR-GPTQ.
- IF4 (adaptive float/int): Minimal overhead for logic and storage, with area only ∼0.016 mm² per MAC, 5% slower at system level (Cook et al., 30 Mar 2026).
- Physical MAC area: 6–7m² for INT4, 8–9m² for FP4 on TSMC 40 nm at $1$0 GHz (Liu et al., 2023).
Efficient memory layouts, scale factor packing, and direct format conversions (e.g. MXFP4 $1$1 FP8), are necessary for contemporary accelerator architectures, including Hopper and Blackwell-class GPUs (Zhang et al., 3 Mar 2026, Liu et al., 4 Aug 2025).
5. FP4 in Diffusion, Vision, and Multimodal Models
FP4 methods increasingly target diffusion models and vision transformers:
- Diffusion Models: FP4-based methods (HQ-DiT, FP4DiT, MSFP) outperform INT4 PTQ in both FID and CLIP metrics, retaining image quality at extreme $1$2 or $1$3 settings (Liu et al., 2024, Chen et al., 19 Mar 2025, Zhao et al., 27 May 2025). Hybrid channel-wise composition, token-wise minmax scaling, and per-layer mixup-sign quantization address high activation dynamic range and asymmetry.
- Multi-Modal LLMs: BATQuant demonstrates MXFP4 can achieve 96.4% recovery in multimodal evaluations under W4A4KV16, by block-wise transformations closely aligned to hardware granularity (Li et al., 17 Mar 2026).
Token- and channel-wise activation quantization and data-driven format selection are fundamental in these architectures.
6. Quantization-Aware Training, Stability, and Error Analysis
Stability in FP4 training is achieved via:
- Differentiable Quantization Estimators: DGE, spline-based, or non-uniform (non-STE) estimators improve backward-pass stability (Wang et al., 28 Jan 2025, Hu et al., 22 Sep 2025).
- Stochastic Rounding (SR): Applied on backward and update passes, limits gradient bias and helps convergence in low-precision regimes (Chmiel et al., 25 May 2025).
- Mean-subtraction ("Averis") and SVD Outlier Filtering: Rank-one bias in activations compresses block dynamic range and scales; mean-centering each block recovers near-BF16 performance at negligible cost (Cao et al., 11 Mar 2026, Liu et al., 2024, Liu et al., 2023).
- Theoretical Thresholds: Quantized training ceases to make learning progress when gradient norms drop below $1$4 times the quantization noise, setting a lower bound for effective FP4-only training (Chmiel et al., 25 May 2025).
A principled recipe for highly robust FP4 training includes block-/tensor-wise scaling, Hadamard scattering (for outlier reduction), and combined use of FP4/INT4/MX/UE5M3 as needed according to workload and model scale (Hu et al., 22 Sep 2025).
7. Performance, Accuracy Tradeoffs, and Recommendations
Comprehensive empirical benchmarks show:
| Format/Method | BOPs/Element | Perplexity (LLaMA-4B/WT2) | Downstream Recovery | Speedup vs. FP16 | Hardware Area Overhead | Main Use-case |
|---|---|---|---|---|---|---|
| NVFP4 | 4.5 | 18.03 | 94–99% | 2–4$1$5 | +12–14% | LLM train/infer |
| MXFP4 (OAS+MBS) | 4.25 | 19.77 (→17.78, IF4/OAS) | 89–97% | 2–4$1$6 | baseline | LLM/MLLM |
| IF4 | 4.5 | 17.78 | 96%+ | ~–5% vs NVFP4 | Negligible | Niche, mix |
| MR-GPTQ | 4–4.5 | 75.84 (Llama-8B) | 93–96% | $1$71% overhead | low | Post-train Q |
| HQ-DiT | – | sFID +0.12 @ W4A4 | $1$899% | 3–5$1$9 | none | Diffusion |
For best-practice deployment:
- Use NVFP4 or IF4 on Blackwell or compatible accelerators. For MXFP4 hardware, augment with OAS+MBS for near-parity.
- For training, apply DGE or advanced estimators, mean-subtraction, blockwise SR, and switch to QAF if gradient noise dominates.
- For vision or diffusion, combine channel-/token-wise scaling, mixup-sign quantization, and universal Hadamard transforms.
- For memory/compute-constrained scenarios or MoE, integrate FP4 storage/comm with row→col fused format conversion (Zhang et al., 3 Mar 2026).
FP4 has matured to a point where, with appropriate algorithmic, calibration, and hardware support, it offers an unrivaled trade-off between efficiency and accuracy for large-model deployment, with continued research closing remaining gaps for extreme-scale and cross-modal regressions.