M0E4 FP4: 4-bit Quantization for Neural Models
- M0E4 FP4 quantization method is a 4‐bit floating-point technique that uses specialized formats (E2M1, E4M0, E0M4) and blockwise scaling to efficiently represent neural network data.
- It employs tailored dequantization strategies and outlier management to ensure minimal accuracy loss while maximizing compute throughput on various hardware platforms.
- Mixed-precision quantization and adaptive block scaling enable near-baseline performance in large models, offering significant speed and memory gains.
The M0E4 FP4 quantization method refers to a class of 4-bit floating-point representations and associated quantization algorithms that target highly efficient training and inference for LLMs, diffusion transformers, and related neural architectures. This approach relies on an FP4 (4-bit floating-point) data type with zero or minimal mantissa, block-wise scaling, and hardware-tailored dequantization strategies to maximize compute throughput while reducing accuracy loss to minimal levels. Several papers detail the construction, implementation, and performance of M0E4 variants, highlighting their utility for both mobile deployment (Li et al., 2024), quantized training (Wang et al., 28 Jan 2025), and post-training quantization (Shao et al., 6 Nov 2025, Zhang et al., 2023).
1. FP4 Number Format: M0E4 Variants
The core M0E4 FP4 format adopts a 4-bit floating-point representation, commonly realized as either E2M1 (2 exponent, 1 mantissa, 1 sign) or E4M0/E0M4 (4 exponent, 0 mantissa, 0 or 1 sign). Typical M0E4 layouts include:
- E2M1 ("standard"): 1 sign, 2 exponent, 1 mantissa bit; 16 code points representing symmetric values in (Wang et al., 28 Jan 2025, Zhang et al., 2023).
- E4M0 ("max exponent"): 1 sign, 4 exponent, 0 mantissa; used in some mobile GPU implementations (Li et al., 2024).
- E0M4 ("max mantissa"): 0 exponent, 4 mantissa; interpreted with groupwise scale and sign (Li et al., 2024).
All M0E4 variants prioritize coarse dynamic range at lowest bit cost, often with special handling for zero and outlier values.
| Format | Bits | Range | Sign Treatment |
|---|---|---|---|
| E2M1 | 1S / 2E / 1M | ±0.5, ±1, ..., ±6 | Per-value |
| E4M0 | 1S / 4E / 0M | Wide, stepwise | Per-value or groupwise |
| E0M4 | 0E / 4M | Narrow, dense near 0 | Fixed per group |
The real value represented by M0E4 FP4 follows: with an exponent bias, mantissa bits, and a sign bit (Liu et al., 2023).
2. Blockwise Quantization and Scaling
Group or blockwise scaling is essential for practical M0E4 deployment. Tensors (weights, activations) are divided into fixed-length blocks (typically 16, 32, or 128 elements). Each block shares a scale parameter (commonly FP8 E4M3) and, in some variants, a shared sign and exponent header (Shao et al., 6 Nov 2025, Li et al., 2024).
- Block scaling process (Li et al., 2024):
- Partition tensor into blocks (e.g., $128$ elements).
- Determine groupwise statistics: and .
- Set scale and bias so all block values fit into one quantization bin.
- Quantize per-element mantissa via round-to-nearest, possible rounding-up by $1$ bit.
- Store block header (scale, bias, sign) and packed FP4 values.
Hardware interpretation: Dequantization involves merely left-shifting mantissa bits and bitwise OR with header, followed by a reinterpret-cast to FP16; host-device conversions are eliminated (Li et al., 2024).
- Power-of-two (PoT) block-scaling: Each block's exponent is chosen as , ensuring precise packing of statistical data (Shao et al., 6 Nov 2025).
3. Quantization Algorithms and Outlier Management
Quantization follows a lookup or analytic mapping to closest FP4 code point: sign bits, scale factor, and, if needed, bias and zero-point compensate tensor-specific distributions.
- Differentiable surrogates (for training): Quantized training employs a smooth differentiable function on each quantization bin, with a gently sloped derivative instead of STE, to propagate meaningful gradients (Wang et al., 28 Jan 2025).
- Outlier clamp and compensation (OCC): Activations are clamped to a quantile threshold (e.g., $99$th percentile); sparse high-magnitude residuals are computed and processed separately in high precision (Wang et al., 28 Jan 2025).
- Per-channel scaling for activations: To avoid precision loss, activation scaling is often performed per channel or token rather than globally (Liu et al., 2023, Wang et al., 28 Jan 2025, Chen et al., 19 Mar 2025).
4. Mixed-Precision, Format Selection, and Layerwise Adaptation
No single FP4 layout universally dominates; mixed-precision and per-layer or per-block format selection is a central theme.
- Mixture-of-Formats Quantization (MoFQ): Each layer in a model is quantized using the FP4 or INT4 format that minimally degrades output (as judged by tensor- or model-level MSE) (Zhang et al., 2023).
- Hybrid (E1M2, E2M1, E0M4, etc.) layer assignments: Layerwise grid search or Hessian-weighted optimization arrives at best allocations, often with higher exponent (E3M0, E4M0) for early linear layers and more mantissa for deep/final layers (Liu et al., 2024, Chen et al., 19 Mar 2025).
- Blockwise orthonormal rotation (BRQ): For PoT scaled formats (MXFP4, M0E4), block-local orthonormal transforms (e.g., Hadamard/Kronecker) are used to redistribute the outlier energy within each quant block, minimizing the impact on blockwise and preserving quantization fidelity (Shao et al., 6 Nov 2025).
5. Implementation and Hardware Efficiency
M0E4 FP4 quantization routinely achieves notable speed and memory gains on modern hardware:
- Mobile GPUs: M0E4 dequant involves only two bitwise operations fused into the matmul kernel. Latency for FP16FP4 compute is up to lower on ARM Mali and matches INT4 on Snapdragon Adreno. Quantization accuracy (MAE) is typically up to lower than INT4 (Li et al., 2024).
- NVIDIA GPUs: NVFP4 and MXFP4 blockwise scaling use FP8 E4M3 for scales. Quantization and dequantization leverage hardware cvt instructions for single-cycle conversion (Cook et al., 1 Dec 2025, Shao et al., 6 Nov 2025).
| Hardware | FP4 Matmul Speedup | Dequantization Overhead |
|---|---|---|
| Mali G720 | 1.5× | Zero |
| Adreno 750 | 1.0× | Zero |
| NVIDIA Blackwell | (inference) | Register-only |
6. Empirical Performance and Application Impact
LLMs trained or quantized with M0E4 FP4 recover near-baseline accuracy while drastically reducing memory and compute cost.
- Training: FP4 schemes with vectorwise scaling, DGE, and OCC reach training loss and downstream accuracy nearly matching BF16 and FP8 for B parameter LLMs (Wang et al., 28 Jan 2025). Empirical results:
- BF16: avg accuracy; FP4: .
- Post-Training Quantization: LLMs such as LLaMA-13B achieve average zero-shot reasoning (4/4/4 bit-width), only $5.8$ points below full precision and above prior PTQ SoTA (Liu et al., 2023). MoFQ schemes recover $96$- of FP16 accuracy (Zhang et al., 2023).
- Diffusion Transformers: W4A4 M0E4 quantization leads to sFID degradation points on ImageNet , with all INT8 baselines exceeding $25$-$200$ sFID (Liu et al., 2024). Mixup-Sign FP4 with timestep-aware LoRA achieves FID/IS nearly matching FP32 across multiple image synthesis tasks (Zhao et al., 27 May 2025).
- Adaptive block-level scaling (Four-Over-Six): Dynamically switching blockwise scale between $4$ and $6$ further recovers of the quantization loss at high values, consistently boosting accuracy to near-BF16 (Cook et al., 1 Dec 2025).
7. Limitations, Controversies, and Future Directions
Several caveats and research directions arise in the context of M0E4 FP4 quantization:
- Dynamic range limitation: With $4$-bit formats, large outliers may be hard-clipped, risking activation collapse unless residual compensation is in place (Wang et al., 28 Jan 2025, Liu et al., 2023).
- Block scaling vs. global rotation incompatibility: Global orthonormal transforms, effective for INT4, degrade blockwise PoT scaling in MXFP4/M0E4 FP4 contexts (Shao et al., 6 Nov 2025). Local block rotation resolves this but introduces complexity.
- Format selection overhead: MoFQ layerwise format selection approximately doubles PTQ calibration time (but no inference overhead) (Zhang et al., 2023).
- Precision fallback: In training, monitoring and switching to higher-precision updates when noise dominates is necessary for convergence (Chmiel et al., 25 May 2025).
This suggests ongoing research will refine hybrid quantization schedules, residual outlier handling, and hardware abstraction layers for unified FP4 deployment.
Key References:
- "Optimizing LLM Training Using FP4 Quantization" (Wang et al., 28 Jan 2025)
- "LLM-FP4: 4-Bit Floating-Point Quantized Transformers" (Liu et al., 2023)
- "Transformer-Lite: High-efficiency Deployment of LLMs on Mobile Phone GPUs" (Li et al., 2024)
- "Integer or Floating Point? New Outlooks for Low-Bit Quantization on LLMs" (Zhang et al., 2023)
- "Block Rotation is All You Need for MXFP4 Quantization" (Shao et al., 6 Nov 2025)
- "FP4 All the Way: Fully Quantized Training of LLMs" (Chmiel et al., 25 May 2025)
- "Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling" (Cook et al., 1 Dec 2025)
- "Pioneering 4-Bit FP Quantization for Diffusion Models: Mixup-Sign Quantization and Timestep-Aware Fine-Tuning" (Zhao et al., 27 May 2025)