Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLM-FP4 Quantization Methods

Updated 25 April 2026
  • LLM-FP4 is a 4-bit quantization approach for LLMs that enables efficient memory usage and throughput while maintaining competitive accuracy.
  • It employs advanced post-training and fully quantized training techniques, including blockwise scaling and error compensation, to optimize model performance.
  • Multiple FP4 formats (e.g., MXFP4, NVFP4) align with diverse hardware accelerators, achieving significant speedups and reduced quantization error over traditional methods.

LLM-FP4 refers to a family of quantization and training strategies for LLMs that rely on 4-bit floating-point (FP4) representations. Building on recent hardware support in dedicated AI accelerators, LLM-FP4 enables aggressive memory and throughput optimizations for both training and inference without incurring prohibitive accuracy loss seen in earlier ultra-low-precision approaches. LLM-FP4 methods span post-training quantization, fully quantized training, blockwise and mixed-format variants, and specialized error compensation frameworks. Research in this area establishes FP4 as a practical drop-in for traditional INT4 and INT8 quantization, and, under carefully devised recipes, a competitive alternative to FP8 or BF16 for pretraining foundation models at scale.

1. FP4 Numerical Formats and Microscaling Strategies

Multiple FP4 numeric layouts have been adopted for LLM applications, each tailored for hardware compatibility and error characteristics:

  • E2M1 (1 sign, 2 exponent, 1 mantissa): The most widely used; represents values x=(−1)s×2e−1×(1+m2)x = (-1)^s \times 2^{e-1} \times (1+\tfrac{m}{2}), with quantization grid ±{0.5,1.0,1.5,2.0,3.0,4.0,6.0}\pm\{0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0\}.
  • Blockwise/Grouped Scaling: Nearly all platforms use microscaling, quantizing a block (usually 16 or 32 elements) with a shared scale:

Most approaches include additional global or tensor-level scaling (e.g., global FP32 or E6M2 on NPU/Ascend platforms) to extend dynamic range.

2. Post-Training Quantization and Inference-Pipeline Variants

LLM-FP4 can be deployed via inference-time-only, post-training quantization (PTQ) pipelines, achieving efficient W4A4 inference. Key contributions include:

  • FP4 PTQ with Exponent/Mantissa Search: "LLM-FP4" introduced a layerwise search for optimal exponent/mantissa/bias per layer using a calibration set and mean-squared error objective, outperforming previous integer or naive floating baselines (Liu et al., 2023). For channel-variant activation distributions, per-channel activation exponents are reparameterized as "pre-shifted exponent bias" absorbed into weight scaling, achieving superior accuracy and hardware efficiency.
  • Mixed-Format Blockwise Quantization: DialectFP4 structures a 16-dialect formatbook of E2M1-like codebooks, chosen per block via a fast two-stage selection to minimize wasted representation range and match block-level histograms, synergizing with integer-only hardware (Jang et al., 2 Jan 2025). MicroMix composes mixed-precision (FP4/FP6/FP8) channels within layers, balancing quantization error and efficiency (Liu et al., 4 Aug 2025).
  • Specialized Rotation and Compensation: DuQuant++ and MR-GPTQ adapt rotation-based error spreading to match block boundaries of MXFP4 and NVFP4, either via outlier-aware (DuQuant++) or blockwise Hadamard transforms (MR-GPTQ), closing the accuracy gap to FP16 (Lin et al., 20 Apr 2026, Egiazarian et al., 27 Sep 2025). Hot-channel patching (HCP) in CHON reinjects quantization residuals for persistent outlier channels in blockwise NVFP4 (Dong et al., 2 Feb 2026).

Table: Principal FP4 Formats for LLM Inference

Format Name Block Size Scale Format Error Spreading Hardware
MXFP4 32 E8M0 (pow2) Hadamard, GPTQ NVIDIA, AMD
NVFP4 16 E4M3 (FP8) Hadamard NVIDIA Blackwell
DialectFP4 32 per-block 16 codebooks Integer hardware
M0E4/E0M4 128 groupwise None Mobile (OpenCL)

3. Fully Quantized Training and Stability Mechanisms

LLM-FP4 pretraining recipes combine blockwise quantization, unbiased or low-variance rounding, and error compensation to realize stable, large-scale LLM training entirely in FP4:

  • Quartet (MXFP4/Blackwell): All linear GEMMs are performed in FP4. On the forward pass, blockwise Hadamard transforms and MSE-optimal quantization (QuEST) minimize outlier damage; the backward pass uses stochastic rounding and randomized Hadamard to decorrelate quantization error. Quartet demonstrates Chinchilla-style efficiency scaling, showing that (bit-for-bit, with hardware throughput benefits) FP4 is optimal in batch regimes for Llama-type and other transformers (Castro et al., 20 May 2025).
  • Quartet II/MS-EDEN (NVFP4): Stochastic rounding is replaced with blockwise Hadamard rotation, RTN quantization, and a scale-correction factor (MS-EDEN), which achieves unbiased gradient estimation at half the error variance of standard SR. This closes the loss gap to BF16 and delivers up to 4.2× end-to-end speedup on Blackwell (Panferov et al., 30 Jan 2026).
  • Mean Bias Removal: Recent work demonstrates that the principal source of catastrophic quantization instability is systematic, rank-one mean bias in layer activations. By performing explicit mean removal before quantization, stability is restored to nearly BF16 levels with minimal computation, outperforming more complex SVD-based spectral regularization (Cao et al., 11 Mar 2026).
  • Vector-wise and Mixed-Precision Training: Earlier FP4 training frameworks (e.g. DGE/OCC) target weight update bias (differentiable quantization estimator) and outlier compensation via residual correction, with mixed-precision for non-GEMM operators and Adam state (Wang et al., 28 Jan 2025).

Table: Empirical Effects of FP4 Training Strategies

Technique Loss gap vs. BF16 Core Stabilizer HW Platform
Quartet (MXFP4) <0.1 (7B Llama) Stochastic SR, Hadamard Blackwell
Quartet II (NVFP4) +1.44% (1.9B) MS-EDEN Blackwell
DGE/OCC+Vec Quant <1–2% (13B) Outlier clamp/comp H100 simulation
Mean-Removal (Averis) 0.03 NLL (0.6B) Source-level mean Qwen3/H100

4. Blockwise and Format-Specific Error Analysis

Several studies systematically dissect the failure modes and sensitivities of FP4 quantization schemes:

  • Component-wise and Blockwise Sensitivity: Empirical analysis demonstrates that MLP up/down projections are extremely sensitive to FP4 quantization, followed by gates, with attention projections being most robust. Early blocks (in smaller models) under MXFP4 can be as critical as final blocks; thus, selective high-precision fallback leads to minimal perplexity increase while maximizing efficiency (Cim et al., 5 Mar 2026).
  • Failure of Standard PTQ on Small Blocks: Outlier mitigation via Hadamard or global rotations is ineffectual in NVFP4 with small block sizes (G=16), neutralizing top-element error reduction. MXFP4's power-of-two scale quantization causes significant error amplification unless compensated by MR-GPTQ or similar blockwise error correction (Egiazarian et al., 27 Sep 2025).
  • Hot-Channel Dynamics: The evolution from transient to stable "hot" outlier channels motivates hardware-efficient online patching (HCP, as in CHON), which recovers second-order residuals for ~9% of channels with negligible overhead (Dong et al., 2 Feb 2026).
  • Mobile and Edge Cases: M0E4 (E0M4) format, as in Transformer-Lite, enables highly efficient groupwise FP4 quantization for on-device inference, with bitwise dequantization but without sub-block scaling or advanced error compensation, showing negligible mean-absolute-error loss versus INT4 PTQ (Li et al., 2024).

5. Hardware and Kernel Implementations

LLM-FP4 research is tightly coupled with hardware developments:

6. Method Selection, Deployment Best Practices, and Limitations

Practitioners are advised to:

  • Prefer NVFP4 over MXFP4 when hardware cost permits for lower quantization error at the expense of larger block overhead.
  • Keep at least up/down-projection layers and, in smaller models, early and late blocks in FP16 for maximal accuracy (Cim et al., 5 Mar 2026).
  • Use block- and channel-wise quantization and rotation, per-channel or per-block exponent bias, and, if possible, mean-removal or hot-channel compensation.
  • Monitor per-layer and per-block quantization sensitivity when migrating models to FP4, especially when moving between hardware platforms or model sizes.
  • On Ascend NPUs, HiFloat4 format with three-level scaling and RHT stabilization is empirically superior for dense and MoE models, maintaining relative error within 1% of full precision (Taghian et al., 9 Apr 2026).
  • PTQ is practically efficient when using MR-GPTQ, DuQuant++, or DialectFP4, yielding near-FP8 or INT4 accuracy in W4A4 pipelines (Egiazarian et al., 27 Sep 2025, Lin et al., 20 Apr 2026, Jang et al., 2 Jan 2025).

Limitations include format-specific saturation/amplification (MXFP4), inefficacy of rotation for small block NVFP4, and remaining gaps in ultra-large and sparsely-activated architectures (Mixture-of-Experts). Ongoing work explores dynamic block sizes, hybrid formats, and in-core implementation of outlier-patching strategies.

7. Future Directions and Open Research Challenges

Active research in LLM-FP4 pursues:

In summary, LLM-FP4 methodologies, grounded in blockwise, rotation/compensation-enhanced, and hardware-composable FP4 schemes, now enable both efficient deployment and accurate pretraining of state-of-the-art LLMs far beyond traditional INT/PTQ pipelines, with continuing research pushing the frontiers of algorithm-hardware co-design and statistical error control (Liu et al., 2023, Castro et al., 20 May 2025, Panferov et al., 30 Jan 2026, Chmiel et al., 25 May 2025, Egiazarian et al., 27 Sep 2025, Lin et al., 20 Apr 2026, Jang et al., 2 Jan 2025, NVIDIA et al., 29 Sep 2025, Cao et al., 11 Mar 2026, Taghian et al., 9 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-FP4.