LLM-FP4 Quantization Methods
- LLM-FP4 is a 4-bit quantization approach for LLMs that enables efficient memory usage and throughput while maintaining competitive accuracy.
- It employs advanced post-training and fully quantized training techniques, including blockwise scaling and error compensation, to optimize model performance.
- Multiple FP4 formats (e.g., MXFP4, NVFP4) align with diverse hardware accelerators, achieving significant speedups and reduced quantization error over traditional methods.
LLM-FP4 refers to a family of quantization and training strategies for LLMs that rely on 4-bit floating-point (FP4) representations. Building on recent hardware support in dedicated AI accelerators, LLM-FP4 enables aggressive memory and throughput optimizations for both training and inference without incurring prohibitive accuracy loss seen in earlier ultra-low-precision approaches. LLM-FP4 methods span post-training quantization, fully quantized training, blockwise and mixed-format variants, and specialized error compensation frameworks. Research in this area establishes FP4 as a practical drop-in for traditional INT4 and INT8 quantization, and, under carefully devised recipes, a competitive alternative to FP8 or BF16 for pretraining foundation models at scale.
1. FP4 Numerical Formats and Microscaling Strategies
Multiple FP4 numeric layouts have been adopted for LLM applications, each tailored for hardware compatibility and error characteristics:
- E2M1 (1 sign, 2 exponent, 1 mantissa): The most widely used; represents values , with quantization grid .
- Blockwise/Grouped Scaling: Nearly all platforms use microscaling, quantizing a block (usually 16 or 32 elements) with a shared scale:
- MXFP4 (E2M1 per element, E8M0 scale): Power-of-two (8-bit) scales for blocks of 32; native to AMD CDNA and NVIDIA B200 Tensor Cores (Castro et al., 20 May 2025).
- NVFP4 (E2M1 per element, E4M3 scale): Finer-grained FP8 (4 exp/3 mantissa) scales for blocks of 16, affording improved dynamic range control (Chmiel et al., 25 May 2025, Panferov et al., 30 Jan 2026, NVIDIA et al., 29 Sep 2025).
- M0E4/E0M4: Used in mobile deployment; 4 bits of mantissa per value with shared sign and exponent at group level for efficient FP16 dequantization (Li et al., 2024).
Most approaches include additional global or tensor-level scaling (e.g., global FP32 or E6M2 on NPU/Ascend platforms) to extend dynamic range.
2. Post-Training Quantization and Inference-Pipeline Variants
LLM-FP4 can be deployed via inference-time-only, post-training quantization (PTQ) pipelines, achieving efficient W4A4 inference. Key contributions include:
- FP4 PTQ with Exponent/Mantissa Search: "LLM-FP4" introduced a layerwise search for optimal exponent/mantissa/bias per layer using a calibration set and mean-squared error objective, outperforming previous integer or naive floating baselines (Liu et al., 2023). For channel-variant activation distributions, per-channel activation exponents are reparameterized as "pre-shifted exponent bias" absorbed into weight scaling, achieving superior accuracy and hardware efficiency.
- Mixed-Format Blockwise Quantization: DialectFP4 structures a 16-dialect formatbook of E2M1-like codebooks, chosen per block via a fast two-stage selection to minimize wasted representation range and match block-level histograms, synergizing with integer-only hardware (Jang et al., 2 Jan 2025). MicroMix composes mixed-precision (FP4/FP6/FP8) channels within layers, balancing quantization error and efficiency (Liu et al., 4 Aug 2025).
- Specialized Rotation and Compensation: DuQuant++ and MR-GPTQ adapt rotation-based error spreading to match block boundaries of MXFP4 and NVFP4, either via outlier-aware (DuQuant++) or blockwise Hadamard transforms (MR-GPTQ), closing the accuracy gap to FP16 (Lin et al., 20 Apr 2026, Egiazarian et al., 27 Sep 2025). Hot-channel patching (HCP) in CHON reinjects quantization residuals for persistent outlier channels in blockwise NVFP4 (Dong et al., 2 Feb 2026).
Table: Principal FP4 Formats for LLM Inference
| Format Name | Block Size | Scale Format | Error Spreading | Hardware |
|---|---|---|---|---|
| MXFP4 | 32 | E8M0 (pow2) | Hadamard, GPTQ | NVIDIA, AMD |
| NVFP4 | 16 | E4M3 (FP8) | Hadamard | NVIDIA Blackwell |
| DialectFP4 | 32 | per-block | 16 codebooks | Integer hardware |
| M0E4/E0M4 | 128 | groupwise | None | Mobile (OpenCL) |
3. Fully Quantized Training and Stability Mechanisms
LLM-FP4 pretraining recipes combine blockwise quantization, unbiased or low-variance rounding, and error compensation to realize stable, large-scale LLM training entirely in FP4:
- Quartet (MXFP4/Blackwell): All linear GEMMs are performed in FP4. On the forward pass, blockwise Hadamard transforms and MSE-optimal quantization (QuEST) minimize outlier damage; the backward pass uses stochastic rounding and randomized Hadamard to decorrelate quantization error. Quartet demonstrates Chinchilla-style efficiency scaling, showing that (bit-for-bit, with hardware throughput benefits) FP4 is optimal in batch regimes for Llama-type and other transformers (Castro et al., 20 May 2025).
- Quartet II/MS-EDEN (NVFP4): Stochastic rounding is replaced with blockwise Hadamard rotation, RTN quantization, and a scale-correction factor (MS-EDEN), which achieves unbiased gradient estimation at half the error variance of standard SR. This closes the loss gap to BF16 and delivers up to 4.2× end-to-end speedup on Blackwell (Panferov et al., 30 Jan 2026).
- Mean Bias Removal: Recent work demonstrates that the principal source of catastrophic quantization instability is systematic, rank-one mean bias in layer activations. By performing explicit mean removal before quantization, stability is restored to nearly BF16 levels with minimal computation, outperforming more complex SVD-based spectral regularization (Cao et al., 11 Mar 2026).
- Vector-wise and Mixed-Precision Training: Earlier FP4 training frameworks (e.g. DGE/OCC) target weight update bias (differentiable quantization estimator) and outlier compensation via residual correction, with mixed-precision for non-GEMM operators and Adam state (Wang et al., 28 Jan 2025).
Table: Empirical Effects of FP4 Training Strategies
| Technique | Loss gap vs. BF16 | Core Stabilizer | HW Platform |
|---|---|---|---|
| Quartet (MXFP4) | <0.1 (7B Llama) | Stochastic SR, Hadamard | Blackwell |
| Quartet II (NVFP4) | +1.44% (1.9B) | MS-EDEN | Blackwell |
| DGE/OCC+Vec Quant | <1–2% (13B) | Outlier clamp/comp | H100 simulation |
| Mean-Removal (Averis) | 0.03 NLL (0.6B) | Source-level mean | Qwen3/H100 |
4. Blockwise and Format-Specific Error Analysis
Several studies systematically dissect the failure modes and sensitivities of FP4 quantization schemes:
- Component-wise and Blockwise Sensitivity: Empirical analysis demonstrates that MLP up/down projections are extremely sensitive to FP4 quantization, followed by gates, with attention projections being most robust. Early blocks (in smaller models) under MXFP4 can be as critical as final blocks; thus, selective high-precision fallback leads to minimal perplexity increase while maximizing efficiency (Cim et al., 5 Mar 2026).
- Failure of Standard PTQ on Small Blocks: Outlier mitigation via Hadamard or global rotations is ineffectual in NVFP4 with small block sizes (G=16), neutralizing top-element error reduction. MXFP4's power-of-two scale quantization causes significant error amplification unless compensated by MR-GPTQ or similar blockwise error correction (Egiazarian et al., 27 Sep 2025).
- Hot-Channel Dynamics: The evolution from transient to stable "hot" outlier channels motivates hardware-efficient online patching (HCP, as in CHON), which recovers second-order residuals for ~9% of channels with negligible overhead (Dong et al., 2 Feb 2026).
- Mobile and Edge Cases: M0E4 (E0M4) format, as in Transformer-Lite, enables highly efficient groupwise FP4 quantization for on-device inference, with bitwise dequantization but without sub-block scaling or advanced error compensation, showing negligible mean-absolute-error loss versus INT4 PTQ (Li et al., 2024).
5. Hardware and Kernel Implementations
LLM-FP4 research is tightly coupled with hardware developments:
- General-Purpose AI Accelerators: Modern NVIDIA Blackwell (TCGen05.mma kernels), AMD CDNA4, Ascend NPU, and Intel Gaudi2 all support blockwise FP4 matmuls in hardware. Typical block sizes: 16 (Blackwell/NVFP4), 32 (MXFP4/CDNA4) (Castro et al., 20 May 2025, Chmiel et al., 25 May 2025, Taghian et al., 9 Apr 2026).
- Kernel Co-design: CUTLASS-based fused epilogues, Triton-fused residual patching kernels, and native post-hoc scale alignment (Quartet II) underpin maximum bandwidth use and minimize error (Panferov et al., 30 Jan 2026, Egiazarian et al., 27 Sep 2025, Dong et al., 2 Feb 2026).
- Mobile Hardware: Bitwise-only FP4 dequantization in M0E4 enables efficient OpenCL kernels for Snapdragon/Mali, without integer-float conversions.
- Overhead/Speedups: FP4 training and inference kernels report up to 4.2× acceleration vs BF16, 2.4× vs. FP8, and 6× layerwise (in MXFP4) (Castro et al., 20 May 2025, Panferov et al., 30 Jan 2026, Egiazarian et al., 27 Sep 2025).
6. Method Selection, Deployment Best Practices, and Limitations
Practitioners are advised to:
- Prefer NVFP4 over MXFP4 when hardware cost permits for lower quantization error at the expense of larger block overhead.
- Keep at least up/down-projection layers and, in smaller models, early and late blocks in FP16 for maximal accuracy (Cim et al., 5 Mar 2026).
- Use block- and channel-wise quantization and rotation, per-channel or per-block exponent bias, and, if possible, mean-removal or hot-channel compensation.
- Monitor per-layer and per-block quantization sensitivity when migrating models to FP4, especially when moving between hardware platforms or model sizes.
- On Ascend NPUs, HiFloat4 format with three-level scaling and RHT stabilization is empirically superior for dense and MoE models, maintaining relative error within 1% of full precision (Taghian et al., 9 Apr 2026).
- PTQ is practically efficient when using MR-GPTQ, DuQuant++, or DialectFP4, yielding near-FP8 or INT4 accuracy in W4A4 pipelines (Egiazarian et al., 27 Sep 2025, Lin et al., 20 Apr 2026, Jang et al., 2 Jan 2025).
Limitations include format-specific saturation/amplification (MXFP4), inefficacy of rotation for small block NVFP4, and remaining gaps in ultra-large and sparsely-activated architectures (Mixture-of-Experts). Ongoing work explores dynamic block sizes, hybrid formats, and in-core implementation of outlier-patching strategies.
7. Future Directions and Open Research Challenges
Active research in LLM-FP4 pursues:
- More general block-formatbooks (DialectFP4+) for adaptivity to heterogenous data distributions (Jang et al., 2 Jan 2025).
- End-to-end co-design of quantization, compensation, optimizer, learning-rate schedule, and hardware kernel (Dong et al., 2 Feb 2026, Panferov et al., 30 Jan 2026).
- Compositional strategies mixing MXFP4/NVFP4/HiFloat4 within or across layers, guided by error metrics or learned importance allocations (Liu et al., 4 Aug 2025).
- Extension to extreme-scale and long-context LLM pretraining runs (multi-trillion tokens) (NVIDIA et al., 29 Sep 2025).
- Tighter coupling with hardware design to accommodate bespoke rotation, residual recovery, and hybrid high/low-precision flows (Taghian et al., 9 Apr 2026, Egiazarian et al., 27 Sep 2025).
- Further theoretical work on quantization-induced anisotropy, block outlier statistics, and error localization—potentially refining current mean-removal, SVD, and patch-based error controls (Cao et al., 11 Mar 2026).
In summary, LLM-FP4 methodologies, grounded in blockwise, rotation/compensation-enhanced, and hardware-composable FP4 schemes, now enable both efficient deployment and accurate pretraining of state-of-the-art LLMs far beyond traditional INT/PTQ pipelines, with continuing research pushing the frontiers of algorithm-hardware co-design and statistical error control (Liu et al., 2023, Castro et al., 20 May 2025, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025, Egiazarian et al., 27 Sep 2025, Lin et al., 20 Apr 2026, Jang et al., 2 Jan 2025, NVIDIA et al., 29 Sep 2025, Cao et al., 11 Mar 2026, Taghian et al., 9 Apr 2026).