Papers
Topics
Authors
Recent
Search
2000 character limit reached

M0E4 FP4: 4-bit Quantization for Neural Models

Updated 30 January 2026
  • M0E4 FP4 quantization method is a 4‐bit floating-point technique that uses specialized formats (E2M1, E4M0, E0M4) and blockwise scaling to efficiently represent neural network data.
  • It employs tailored dequantization strategies and outlier management to ensure minimal accuracy loss while maximizing compute throughput on various hardware platforms.
  • Mixed-precision quantization and adaptive block scaling enable near-baseline performance in large models, offering significant speed and memory gains.

The M0E4 FP4 quantization method refers to a class of 4-bit floating-point representations and associated quantization algorithms that target highly efficient training and inference for LLMs, diffusion transformers, and related neural architectures. This approach relies on an FP4 (4-bit floating-point) data type with zero or minimal mantissa, block-wise scaling, and hardware-tailored dequantization strategies to maximize compute throughput while reducing accuracy loss to minimal levels. Several papers detail the construction, implementation, and performance of M0E4 variants, highlighting their utility for both mobile deployment (Li et al., 2024), quantized training (Wang et al., 28 Jan 2025), and post-training quantization (Shao et al., 6 Nov 2025, Zhang et al., 2023).

1. FP4 Number Format: M0E4 Variants

The core M0E4 FP4 format adopts a 4-bit floating-point representation, commonly realized as either E2M1 (2 exponent, 1 mantissa, 1 sign) or E4M0/E0M4 (4 exponent, 0 mantissa, 0 or 1 sign). Typical M0E4 layouts include:

  • E2M1 ("standard"): 1 sign, 2 exponent, 1 mantissa bit; 16 code points representing symmetric values in SFP4={6,4,3,...,0,...,6}\mathcal{S}_{\rm FP4} = \{-6, -4, -3, ..., 0, ..., 6\} (Wang et al., 28 Jan 2025, Zhang et al., 2023).
  • E4M0 ("max exponent"): 1 sign, 4 exponent, 0 mantissa; used in some mobile GPU implementations (Li et al., 2024).
  • E0M4 ("max mantissa"): 0 exponent, 4 mantissa; interpreted with groupwise scale and sign (Li et al., 2024).

All M0E4 variants prioritize coarse dynamic range at lowest bit cost, often with special handling for zero and outlier values.

Format Bits Range Sign Treatment
E2M1 1S / 2E / 1M ±0.5, ±1, ..., ±6 Per-value
E4M0 1S / 4E / 0M Wide, stepwise Per-value or groupwise
E0M4 0E / 4M Narrow, dense near 0 Fixed per group

The real value represented by M0E4 FP4 follows: XFP4=(1)s2eb(1+i=1mdi2i)X_{\rm FP4} = (-1)^s \cdot 2^{e-b} \cdot (1 + \sum_{i=1}^m d_i 2^{-i}) with bb an exponent bias, did_i mantissa bits, and ss a sign bit (Liu et al., 2023).

2. Blockwise Quantization and Scaling

Group or blockwise scaling is essential for practical M0E4 deployment. Tensors (weights, activations) are divided into fixed-length blocks (typically 16, 32, or 128 elements). Each block shares a scale parameter (commonly FP8 E4M3) and, in some variants, a shared sign and exponent header (Shao et al., 6 Nov 2025, Li et al., 2024).

  • Block scaling process (Li et al., 2024):

    1. Partition tensor into blocks (e.g., $128$ elements).
    2. Determine groupwise statistics: max\max and min\min.
    3. Set scale and bias so all block values fit into one quantization bin.
    4. Quantize per-element mantissa via round-to-nearest, possible rounding-up by $1$ bit.
    5. Store block header (scale, bias, sign) and packed FP4 values.
  • Hardware interpretation: Dequantization involves merely left-shifting mantissa bits and bitwise OR with header, followed by a reinterpret-cast to FP16; host-device conversions are eliminated (Li et al., 2024).

  • Power-of-two (PoT) block-scaling: Each block's exponent is chosen as eblock=log2(maxWblock)emax,locale_{\rm block} = \lfloor \log_2 (\max |W_{\rm block}|) \rfloor - e_{\max,\rm local}, ensuring precise packing of statistical data (Shao et al., 6 Nov 2025).

3. Quantization Algorithms and Outlier Management

Quantization follows a lookup or analytic mapping to closest FP4 code point: Q(x)=argminsSFP4xsQ(x) = \arg\min_{s \in \mathcal{S}_{\rm FP4}} |x - s| sign bits, scale factor, and, if needed, bias and zero-point compensate tensor-specific distributions.

4. Mixed-Precision, Format Selection, and Layerwise Adaptation

No single FP4 layout universally dominates; mixed-precision and per-layer or per-block format selection is a central theme.

  • Mixture-of-Formats Quantization (MoFQ): Each layer in a model is quantized using the FP4 or INT4 format that minimally degrades output (as judged by tensor- or model-level MSE) (Zhang et al., 2023).
  • Hybrid (E1M2, E2M1, E0M4, etc.) layer assignments: Layerwise grid search or Hessian-weighted optimization arrives at best (e,m)(e,m) allocations, often with higher exponent (E3M0, E4M0) for early linear layers and more mantissa for deep/final layers (Liu et al., 2024, Chen et al., 19 Mar 2025).
  • Blockwise orthonormal rotation (BRQ): For PoT scaled formats (MXFP4, M0E4), block-local orthonormal transforms (e.g., Hadamard/Kronecker) are used to redistribute the outlier energy within each quant block, minimizing the impact on blockwise max\max and preserving quantization fidelity (Shao et al., 6 Nov 2025).

5. Implementation and Hardware Efficiency

M0E4 FP4 quantization routinely achieves notable speed and memory gains on modern hardware:

  • Mobile GPUs: M0E4 dequant involves only two bitwise operations fused into the matmul kernel. Latency for FP16×\timesFP4 compute is up to 1.5×1.5\times lower on ARM Mali and matches INT4 on Snapdragon Adreno. Quantization accuracy (MAE) is typically up to 4.5%4.5\% lower than INT4 (Li et al., 2024).
  • NVIDIA GPUs: NVFP4 and MXFP4 blockwise scaling use FP8 E4M3 for scales. Quantization and dequantization leverage hardware cvt instructions for single-cycle conversion (Cook et al., 1 Dec 2025, Shao et al., 6 Nov 2025).
Hardware FP4 Matmul Speedup Dequantization Overhead
Mali G720 \sim1.5× Zero
Adreno 750 1.0× Zero
NVIDIA Blackwell <2%<2\% (inference) Register-only

6. Empirical Performance and Application Impact

LLMs trained or quantized with M0E4 FP4 recover near-baseline accuracy while drastically reducing memory and compute cost.

  • Training: FP4 schemes with vectorwise scaling, DGE, and OCC reach training loss and downstream accuracy nearly matching BF16 and FP8 for >13>13B parameter LLMs (Wang et al., 28 Jan 2025). Empirical results:
    • BF16: 54.44%54.44\% avg accuracy; FP4: 54.95%54.95\%.
  • Post-Training Quantization: LLMs such as LLaMA-13B achieve 63.1%63.1\% average zero-shot reasoning (4/4/4 bit-width), only $5.8$ points below full precision and +12.7+12.7 above prior PTQ SoTA (Liu et al., 2023). MoFQ schemes recover $96$-98%98\% of FP16 accuracy (Zhang et al., 2023).
  • Diffusion Transformers: W4A4 M0E4 quantization leads to sFID degradation <0.12<0.12 points on ImageNet 256×256256\times 256, with all INT8 baselines exceeding $25$-$200$ sFID (Liu et al., 2024). Mixup-Sign FP4 with timestep-aware LoRA achieves FID/IS nearly matching FP32 across multiple image synthesis tasks (Zhao et al., 27 May 2025).
  • Adaptive block-level scaling (Four-Over-Six): Dynamically switching blockwise scale between $4$ and $6$ further recovers 20%20\% of the quantization loss at high values, consistently boosting accuracy to near-BF16 (Cook et al., 1 Dec 2025).

7. Limitations, Controversies, and Future Directions

Several caveats and research directions arise in the context of M0E4 FP4 quantization:

  • Dynamic range limitation: With $4$-bit formats, large outliers may be hard-clipped, risking activation collapse unless residual compensation is in place (Wang et al., 28 Jan 2025, Liu et al., 2023).
  • Block scaling vs. global rotation incompatibility: Global orthonormal transforms, effective for INT4, degrade blockwise PoT scaling in MXFP4/M0E4 FP4 contexts (Shao et al., 6 Nov 2025). Local block rotation resolves this but introduces complexity.
  • Format selection overhead: MoFQ layerwise format selection approximately doubles PTQ calibration time (but no inference overhead) (Zhang et al., 2023).
  • Precision fallback: In training, monitoring L/(σqd)\|\nabla L\|/(\sigma_q \sqrt d) and switching to higher-precision updates when noise dominates is necessary for convergence (Chmiel et al., 25 May 2025).

This suggests ongoing research will refine hybrid quantization schedules, residual outlier handling, and hardware abstraction layers for unified FP4 deployment.


Key References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to M0E4 FP4 Quantization Method.