Papers
Topics
Authors
Recent
2000 character limit reached

Mixed FP8 Quantization

Updated 17 December 2025
  • Mixed FP8 quantization is an adaptive method that strategically assigns FP8 formats (such as E4M3 and E5M2) based on tensor properties to balance precision and dynamic range.
  • It employs per-tensor, per-group, and blockwise strategies to minimize quantization error, reduce memory footprint, and boost training efficiency.
  • Empirical benchmarks demonstrate that mixed FP8 techniques achieve significant speed, memory, and energy improvements with near-lossless model accuracy on advanced hardware.

Mixed FP8 quantization refers to a class of quantization methodologies that judiciously combine distinct 8-bit floating-point (FP8) formats—usually E4M3 and E5M2, and sometimes other variants—at fine or coarse granularity across the tensors or even within the blocks of deep neural networks, with the objective of maximizing efficiency (memory, speed, energy) while nearly preserving full-precision model fidelity. Unlike uniform precision quantization, where a single number format (e.g., INT8, FP8-E4M3) is used globally, mixed FP8 quantization assigns formats adaptively, often guided by data-driven or analytic criteria (per-layer, per-group, or per-channel) and exploits unique trade-offs between dynamic range and precision in each format. This encyclopedic entry surveys the mathematical foundations, canonical approaches, software and hardware implications, empirical results, and practical deployment considerations of mixed FP8 quantization.

1. FP8 Numeric Formats and Representation

The mixed FP8 quantization paradigm relies primarily on two Open Compute Project (OCP)-standard FP8 encodings [E4M3, E5M2], but some frameworks employ additional variants such as E3M4 or E2M5. The FP8 binary layout consists of a sign bit, ee exponent bits, and mm mantissa bits, where dynamic range and precision vary inversely.

Format Exponent Bits Mantissa Bits Exponent Bias Normalized Range Machine Epsilon Intended Use
E4M3 4 3 7 [26,480]\sim [2^{-6}, 480] 232^{-3} Weights, Opt. 1st Mom
E5M2 5 2 15 [214,5.7×104]\sim [2^{-14}, 5.7\times 10^4] 222^{-2} Activations, Opt. 2nd Mom
E3M4 3 4 3 [22,17]\sim [2^{-2}, 17] 242^{-4} Vision models

E4M3 provides finer quantization near zero and is robust for tensors with moderate dynamic range; E5M2 sacrifices mantissa accuracy for extended coverage of outliers, vital for quantities such as second-order optimizer states or activation spikes (Xi et al., 25 Oct 2024, Fishman et al., 19 Sep 2024, Wang et al., 26 Sep 2025, Shen et al., 2023, Peng et al., 2023).

2. Rationale and Statistical Motivation

Low-bit floating-point quantization is favored over integer quantization where activation or weight distributions exhibit high kurtosis, wide dynamic range, or frequent outliers. The dynamic nature of neural activations and section-wise variation in tensor statistics across layers of large models—especially LLMs and vision transformers—renders a uniform quantization approach suboptimal (Zhang et al., 2023, Shen et al., 2023, Zhang et al., 2023). Mixed FP8 quantization seeks to:

  • Exploit complementary strengths: Assign E4M3 where higher resolution is critical, E5M2 (or further variants) where dynamic range is limiting, and in some frameworks, integrate INT8 or FP16 where needed.
  • Minimize quantization-induced loss: Adaptive format selection can empirically halve the mean-squared quantization error compared to uniform FP8 and recover 0.5%0.5\%1%1\% of model accuracy on challenging tasks (Shen et al., 2023, Dotzel et al., 2023).
  • Efficiently encode rare outliers: Retain high dynamic range only in the portions of tensors where outlier amplitudes arise (activation “spikes” in LLM projections, second-moment optimizer states) (Xi et al., 25 Oct 2024, Maisonnave et al., 30 Apr 2025, Liang et al., 28 Nov 2025).

3. Methodological Approaches to Mixed FP8 Quantization

Implementation strategies primarily fall into three classes:

a. Per-Tensor/Per-Group Adaptive Assignments

Frameworks such as COAT (Xi et al., 25 Oct 2024), InfiR2 (Wang et al., 26 Sep 2025), and MoFQ (Zhang et al., 2023) select the FP8 format per tensor (or even per group) based on the statistical properties (max, std, kurtosis) of the tensor, minimizing either simple MSE or more involved information-theoretic costs.

  • Dynamic Range Expansion (COAT): optimiser states are non-linearly transformed f(x)=sign(x)xkf(x)=\mathrm{sign}(x)|x|^k before quantization to fit their empirical spread to the native range of the FP8 format, then inverted post-dequant (Xi et al., 25 Oct 2024).
  • Hybrid-Granularity Quantization (InfiR2): weights are quantized blockwise in E4M3, activations tokenwise in E5M2, with power-of-two rounding for scaling—yielding near-lossless training convergence (Wang et al., 26 Sep 2025).

b. Fine-Grained/Blockwise Policies

FGMP (Hooper et al., 19 Apr 2025) employs Fisher-weighted per-block assignment:

  • Blocks of weights and activations are allocated to FP8 or even lower-precision (e.g., NVFP4, FP4) based on the block’s estimated impact on model loss, leveraging the diagonal Fisher information matrix.
  • Sensitivity-weighted clipping further reduces high-magnitude, low-importance errors.

c. Outlier- and Architecture-Aware Schemes

  • Spike-Aware Mixed-Precision (SAMPQ): Detects and isolates rare spiking layers (e.g., initial/final projections) that require high-range FP8 or FP16, quantizing the bulk of the model in INT8/FP8 for large memory and compute savings (Maisonnave et al., 30 Apr 2025).
  • TWEO: Introduces a regularization loss that eliminates mechanically-induced extreme outliers, allowing for 100% FP8 coverage and enabling standard low-bit quantization schemes to perform at full-precision fidelity (Liang et al., 28 Nov 2025).

4. Mathematical Formulation and Quantization Workflow

The quantization workflow unrolls as follows (notation adheres to (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025)):

  1. Collect Tensors: For each quantizable tensor (weight, activation, optimizer state), calibrate its empirical maxima.
  2. Select Format: Choose FP8 variant F{E4M3,E5M2,...}F^*\in\{\text{E4M3}, \text{E5M2},... \} minimizing a metric:

F=argminF  MSEF or alternative costF^* = \arg\min_{F}\;\mathrm{MSE}_F \text{ or alternative cost}

  1. Compute Scale: Per tensor/group/channel, set S=maxX/VmaxFS = \max|X| / V_{\max}^{F^*}, possibly snap to nearest power-of-two (Wang et al., 26 Sep 2025).
  2. Quantization Mapping:

Q(xi;S)=Clip(round(xi/S),qmin,qmax)SQ(x_i; S) = \mathrm{Clip}\left(\mathrm{round}(x_i/S),\,q_{\min},\,q_{\max}\right) \cdot S

  1. Dynamic Range Expansion (when required, e.g., optimizer): apply f(x)f(x) and its inverse as detailed under COAT (Xi et al., 25 Oct 2024).
  2. Assignment and Kernel Routing: Pass the scale and format metadata forward for (re)quantization or dequantization within specialized GEMM kernels (e.g., NVIDIA FP8 tensor cores, Blackwell mxFP kernels).

Granularity is highly implementation-dependent: Per-tensor schemes simplify kernel dispatch and minimize metadata; per-group/channel arrangements optimize accuracy but increase memory for scale storage and control logic.

5. Empirical Benchmarks and Engineering Impact

Across architectures (LLMs, CNNs, VLMs, ViTs), mixed FP8 quantization demonstrates:

6. Hardware and Implementation Considerations

Mixed FP8 quantization is most effective when coupled with hardware supporting both exponent/mantissa parameterizations and dynamic routing of kernel operations:

  • Native Kernel Support: Modern architectures (e.g., NVIDIA Hopper, Blackwell) provide FP8 tensor cores—supporting E4M3/E5M2—and custom operators (e.g., MicroMix, FP8-LM) for blockwise/channelwise mapping (Liu et al., 4 Aug 2025, Peng et al., 2023).
  • Control Logic and Overhead: Adding per-block mixed-precision incurs minimal area overhead (<5%5\%) and negligible per-operation energy, provided kernel fusion and dataflow are optimized (Hooper et al., 19 Apr 2025, Zhang et al., 2023).
  • Scale Metadata: The storage and update of scaling factors must be bandwidth- and memory-efficient; power-of-two rounding and quantized scale representations (e.g., E8M0) are frequently used (Wang et al., 26 Sep 2025, Liu et al., 4 Aug 2025).
  • Software Integration: Model wrappers (PyTorch modules, custom quantize-dequantize operators), calibration routines, and format-selection heuristics are integrated into existing pipelines without hyperparameter tuning (Xi et al., 25 Oct 2024, Peng et al., 2023, Shen et al., 2023).

7. Best Practices and Deployment Guidelines


Mixed FP8 quantization is a principled, empirically validated approach for maximally efficient memory, compute, and energy usage in modern large-scale neural networks, yielding near-lossless accuracy and throughput on par or better than uniform quantization on current accelerator hardware (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025, Hooper et al., 19 Apr 2025, Peng et al., 2023, Liang et al., 28 Nov 2025, Dotzel et al., 2023, Shen et al., 2023, Zhang et al., 2023).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Mixed FP8 Quantization.