Papers
Topics
Authors
Recent
Search
2000 character limit reached

FP4 Quantization: Ultra-Low Precision Neural Ops

Updated 10 April 2026
  • FP4 Quantization is an ultra-low precision numerical method that replaces traditional 16/32-bit arithmetic with 4-bit operations, offering significant speed, memory, and energy advantages.
  • It utilizes a two-stage quantization pipeline—combining blockwise scaling and per-element packing—with advanced strategies like Overflow-Aware Scaling and Macro Block Scaling to mitigate precision loss.
  • FP4 methods have been successfully applied in large language models, diffusion, and multimodal systems, achieving near-baseline accuracy while optimizing hardware efficiency and resource utilization.

Four-bit floating-point quantization (FP4) is an ultra-low-precision numerical representation and quantization methodology that enables highly efficient deployment and training of large neural networks. FP4 achieves significant speed, memory, and energy advantages by replacing standard 16- or 32-bit floating-point arithmetic with 4-bit floating-point numbers, subject to the challenges of coarse resolution and limited dynamic range. Recent research has systematically advanced both the theoretical underpinnings and practical hardware/software systems for FP4 inference and training across LLMs, diffusion models, and multi-modal architectures.

1. FP4 Fundamentals and Quantization Formats

FP4 encompasses a family of 4-bit floating-point representations, each specified by a choice of sign (S), exponent (E), and mantissa (M) bit allocations and bias:

Format (S,E,M) Bias Value Range (normals) Typical Block Size Scale Format
E2M1 (NVFP4) 1,2,1 211=12^{1}-1=1 ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\} 16 E4M3 (8 bits)
E2M1 (MXFP4) 1,2,1 $1$ ±{0.5,1,1.5,2,3,4,6}\pm\{0.5,1,1.5,2,3,4,6\} 32 E8M0 (8 bits, pow2)
E3M0 1,3,0 221=32^{2}-1=3 ±{0.125,0.25,0.5,,4}\pm\{0.125,0.25,0.5,\dots,4\} varies varies
IF4 (int/float mux) ±[0,1,,7]\pm[0,1,…,7] or FP4 16 E4M3 (sign as mux flag)

NVFP4 (NVIDIA) and MXFP4 (AMD/OCP) variants dominate hardware support. Both block macroscaling (16–32 values per scale) and fine-grained quantization are standard. Blockwise scaling is essential for keeping quantization error bounded under ultra-low bitwidths (Cim et al., 5 Mar 2026, Cook et al., 30 Mar 2026).

2. Algorithms and Quantization Schemes

Fundamental to FP4 is a two-stage quantization pipeline:

  • Blockwise scaling: For block B={x1,,xG}B = \{x_1,\ldots,x_{G}\}, scale ss is set as

s=maxixiFP4maxs = \frac{\max_i |x_i|}{\text{FP4}_{\max}}

with subsequent block value quantized as ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}0, and stored as 4 bits.

  • Per-element packing: Each 4-bit result encodes a sign, exponent, and mantissa.

Several advanced strategies are developed to overcome inherent limitations:

  • Overflow-Aware Scaling (OAS): Dynamically shifts scale exponent if the block maximum falls near the lower edge of a scale bin, doubling dynamic range at negligible error cost (Chhugani et al., 30 Jan 2026).
  • Macro Block Scaling (MBS): Allocates enhanced precision at the macro-block (e.g., ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}1 sub-blocks) level to mitigate outlier-induced overflow (Chhugani et al., 30 Jan 2026).
  • Adaptive Data Types (IF4): Mixes blockwise FP4/INT4 encodings depending on per-block MSE, flagged via scale sign bit, further lowering error (Cook et al., 30 Mar 2026).
  • Mixed-precision allocation and thresholds: Hybridization with higher-precision channels for outlier-heavy rows/channels (as in MicroMix, HQ-DiT) (Liu et al., 4 Aug 2025, Liu et al., 2024).
  • Block-wise affine and Kronecker transformations: In BATQuant, block-local, learnable affine maps re-shape input distributions without cross-block outlier spread, outperforming previous global or orthogonal rotations (Li et al., 17 Mar 2026).

The quantization process balances dynamic range, outlier suppression, and hardware efficiency, and is adapted by algorithmic innovations such as differentiable quantization estimators (Wang et al., 28 Jan 2025), mixture-of-formats layer-wise selection (Zhang et al., 2023), and scale-aware AdaRound (Chen et al., 19 Mar 2025).

3. FP4 for LLM Training and Inference

Modern LLMs exploit FP4 for weights, activations, and gradients (full W4A4G4), enabled by blockwise NVFP4 or MXFP4 quantization:

  • Training: Differentiable gradient estimators (beyond naive STE), mixed-precision scheduling (e.g., FP16/BF16 for non-matmul ops), vector-wise/channel-wise scaling, and outlier clamping+compensation restore training stability. Loss gap to BF16 is reduced to ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}2 at 13B model scale, and zero-shot task accuracy can even improve over baseline (Wang et al., 28 Jan 2025, Chmiel et al., 25 May 2025).
  • Inference: Large-scale systematic analysis shows:
    • MLP up/down projections are consistently most sensitive to FP4 errors; attention projections are robust (Cim et al., 5 Mar 2026).
    • Per-layer and per-channel scale selection (LLM-FP4) closes the gap with BF16/FP16 inference, e.g., LLaMA-13B suffers only ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}3 accuracy drop (63.1 vs 68.9), dramatically outperforming pure INT4 (Liu et al., 2023).
    • Mixture-of-formats quantization (MoFQ) and MR-GPTQ boost recovery, bringing MXFP4 accuracy to ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}41–2% of NVFP4 (Egiazarian et al., 27 Sep 2025).

FP4 quantization is now the standard for high-throughput kernel implementations in Blackwell/TensorCore GPUs, with up to ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}5 throughput over FP16 and sub-1% end-to-end quality degradation (Zhang et al., 16 May 2025, Egiazarian et al., 27 Sep 2025).

4. Architectural and Hardware Considerations

The FP4 design space includes E2M1, E3M0, and extended-range UE5M3 variants, with the following trade-offs (Hu et al., 22 Sep 2025):

  • NVFP4 (block 16, E4M3 scales): Optimal runtime accuracy, moderate area overhead (+12–15%), often selected.
  • MXFP4 (block 32, E8M0 scales): Lower area, initially higher error unless enhanced by OAS/MBS or MR-GPTQ.
  • IF4 (adaptive float/int): Minimal overhead for logic and storage, with area only ∼0.016 mm² per MAC, 5% slower at system level (Cook et al., 30 Mar 2026).
  • Physical MAC area: ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}6–±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}7m² for INT4, ±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}8–±{0,0.5,1,1.5,2,3,4,6}\pm\{0,0.5,1,1.5,2,3,4,6\}9m² for FP4 on TSMC 40 nm at $1$0 GHz (Liu et al., 2023).

Efficient memory layouts, scale factor packing, and direct format conversions (e.g. MXFP4 $1$1 FP8), are necessary for contemporary accelerator architectures, including Hopper and Blackwell-class GPUs (Zhang et al., 3 Mar 2026, Liu et al., 4 Aug 2025).

5. FP4 in Diffusion, Vision, and Multimodal Models

FP4 methods increasingly target diffusion models and vision transformers:

  • Diffusion Models: FP4-based methods (HQ-DiT, FP4DiT, MSFP) outperform INT4 PTQ in both FID and CLIP metrics, retaining image quality at extreme $1$2 or $1$3 settings (Liu et al., 2024, Chen et al., 19 Mar 2025, Zhao et al., 27 May 2025). Hybrid channel-wise composition, token-wise minmax scaling, and per-layer mixup-sign quantization address high activation dynamic range and asymmetry.
  • Multi-Modal LLMs: BATQuant demonstrates MXFP4 can achieve 96.4% recovery in multimodal evaluations under W4A4KV16, by block-wise transformations closely aligned to hardware granularity (Li et al., 17 Mar 2026).

Token- and channel-wise activation quantization and data-driven format selection are fundamental in these architectures.

6. Quantization-Aware Training, Stability, and Error Analysis

Stability in FP4 training is achieved via:

A principled recipe for highly robust FP4 training includes block-/tensor-wise scaling, Hadamard scattering (for outlier reduction), and combined use of FP4/INT4/MX/UE5M3 as needed according to workload and model scale (Hu et al., 22 Sep 2025).

7. Performance, Accuracy Tradeoffs, and Recommendations

Comprehensive empirical benchmarks show:

Format/Method BOPs/Element Perplexity (LLaMA-4B/WT2) Downstream Recovery Speedup vs. FP16 Hardware Area Overhead Main Use-case
NVFP4 4.5 18.03 94–99% 2–4$1$5 +12–14% LLM train/infer
MXFP4 (OAS+MBS) 4.25 19.77 (→17.78, IF4/OAS) 89–97% 2–4$1$6 baseline LLM/MLLM
IF4 4.5 17.78 96%+ ~–5% vs NVFP4 Negligible Niche, mix
MR-GPTQ 4–4.5 75.84 (Llama-8B) 93–96% $1$71% overhead low Post-train Q
HQ-DiT sFID +0.12 @ W4A4 $1$899% 3–5$1$9 none Diffusion

For best-practice deployment:

  • Use NVFP4 or IF4 on Blackwell or compatible accelerators. For MXFP4 hardware, augment with OAS+MBS for near-parity.
  • For training, apply DGE or advanced estimators, mean-subtraction, blockwise SR, and switch to QAF if gradient noise dominates.
  • For vision or diffusion, combine channel-/token-wise scaling, mixup-sign quantization, and universal Hadamard transforms.
  • For memory/compute-constrained scenarios or MoE, integrate FP4 storage/comm with row→col fused format conversion (Zhang et al., 3 Mar 2026).

FP4 has matured to a point where, with appropriate algorithmic, calibration, and hardware support, it offers an unrivaled trade-off between efficiency and accuracy for large-model deployment, with continued research closing remaining gaps for extreme-scale and cross-modal regressions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FP4 Quantization.