Mixed FP8 Quantization
- Mixed FP8 quantization is an adaptive method that strategically assigns FP8 formats (such as E4M3 and E5M2) based on tensor properties to balance precision and dynamic range.
- It employs per-tensor, per-group, and blockwise strategies to minimize quantization error, reduce memory footprint, and boost training efficiency.
- Empirical benchmarks demonstrate that mixed FP8 techniques achieve significant speed, memory, and energy improvements with near-lossless model accuracy on advanced hardware.
Mixed FP8 quantization refers to a class of quantization methodologies that judiciously combine distinct 8-bit floating-point (FP8) formats—usually E4M3 and E5M2, and sometimes other variants—at fine or coarse granularity across the tensors or even within the blocks of deep neural networks, with the objective of maximizing efficiency (memory, speed, energy) while nearly preserving full-precision model fidelity. Unlike uniform precision quantization, where a single number format (e.g., INT8, FP8-E4M3) is used globally, mixed FP8 quantization assigns formats adaptively, often guided by data-driven or analytic criteria (per-layer, per-group, or per-channel) and exploits unique trade-offs between dynamic range and precision in each format. This encyclopedic entry surveys the mathematical foundations, canonical approaches, software and hardware implications, empirical results, and practical deployment considerations of mixed FP8 quantization.
1. FP8 Numeric Formats and Representation
The mixed FP8 quantization paradigm relies primarily on two Open Compute Project (OCP)-standard FP8 encodings [E4M3, E5M2], but some frameworks employ additional variants such as E3M4 or E2M5. The FP8 binary layout consists of a sign bit, exponent bits, and mantissa bits, where dynamic range and precision vary inversely.
| Format | Exponent Bits | Mantissa Bits | Exponent Bias | Normalized Range | Machine Epsilon | Intended Use |
|---|---|---|---|---|---|---|
| E4M3 | 4 | 3 | 7 | Weights, Opt. 1st Mom | ||
| E5M2 | 5 | 2 | 15 | Activations, Opt. 2nd Mom | ||
| E3M4 | 3 | 4 | 3 | Vision models |
E4M3 provides finer quantization near zero and is robust for tensors with moderate dynamic range; E5M2 sacrifices mantissa accuracy for extended coverage of outliers, vital for quantities such as second-order optimizer states or activation spikes (Xi et al., 25 Oct 2024, Fishman et al., 19 Sep 2024, Wang et al., 26 Sep 2025, Shen et al., 2023, Peng et al., 2023).
2. Rationale and Statistical Motivation
Low-bit floating-point quantization is favored over integer quantization where activation or weight distributions exhibit high kurtosis, wide dynamic range, or frequent outliers. The dynamic nature of neural activations and section-wise variation in tensor statistics across layers of large models—especially LLMs and vision transformers—renders a uniform quantization approach suboptimal (Zhang et al., 2023, Shen et al., 2023, Zhang et al., 2023). Mixed FP8 quantization seeks to:
- Exploit complementary strengths: Assign E4M3 where higher resolution is critical, E5M2 (or further variants) where dynamic range is limiting, and in some frameworks, integrate INT8 or FP16 where needed.
- Minimize quantization-induced loss: Adaptive format selection can empirically halve the mean-squared quantization error compared to uniform FP8 and recover – of model accuracy on challenging tasks (Shen et al., 2023, Dotzel et al., 2023).
- Efficiently encode rare outliers: Retain high dynamic range only in the portions of tensors where outlier amplitudes arise (activation “spikes” in LLM projections, second-moment optimizer states) (Xi et al., 25 Oct 2024, Maisonnave et al., 30 Apr 2025, Liang et al., 28 Nov 2025).
3. Methodological Approaches to Mixed FP8 Quantization
Implementation strategies primarily fall into three classes:
a. Per-Tensor/Per-Group Adaptive Assignments
Frameworks such as COAT (Xi et al., 25 Oct 2024), InfiR2 (Wang et al., 26 Sep 2025), and MoFQ (Zhang et al., 2023) select the FP8 format per tensor (or even per group) based on the statistical properties (max, std, kurtosis) of the tensor, minimizing either simple MSE or more involved information-theoretic costs.
- Dynamic Range Expansion (COAT): optimiser states are non-linearly transformed before quantization to fit their empirical spread to the native range of the FP8 format, then inverted post-dequant (Xi et al., 25 Oct 2024).
- Hybrid-Granularity Quantization (InfiR2): weights are quantized blockwise in E4M3, activations tokenwise in E5M2, with power-of-two rounding for scaling—yielding near-lossless training convergence (Wang et al., 26 Sep 2025).
b. Fine-Grained/Blockwise Policies
FGMP (Hooper et al., 19 Apr 2025) employs Fisher-weighted per-block assignment:
- Blocks of weights and activations are allocated to FP8 or even lower-precision (e.g., NVFP4, FP4) based on the block’s estimated impact on model loss, leveraging the diagonal Fisher information matrix.
- Sensitivity-weighted clipping further reduces high-magnitude, low-importance errors.
c. Outlier- and Architecture-Aware Schemes
- Spike-Aware Mixed-Precision (SAMPQ): Detects and isolates rare spiking layers (e.g., initial/final projections) that require high-range FP8 or FP16, quantizing the bulk of the model in INT8/FP8 for large memory and compute savings (Maisonnave et al., 30 Apr 2025).
- TWEO: Introduces a regularization loss that eliminates mechanically-induced extreme outliers, allowing for 100% FP8 coverage and enabling standard low-bit quantization schemes to perform at full-precision fidelity (Liang et al., 28 Nov 2025).
4. Mathematical Formulation and Quantization Workflow
The quantization workflow unrolls as follows (notation adheres to (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025)):
- Collect Tensors: For each quantizable tensor (weight, activation, optimizer state), calibrate its empirical maxima.
- Select Format: Choose FP8 variant minimizing a metric:
- Compute Scale: Per tensor/group/channel, set , possibly snap to nearest power-of-two (Wang et al., 26 Sep 2025).
- Quantization Mapping:
- Dynamic Range Expansion (when required, e.g., optimizer): apply and its inverse as detailed under COAT (Xi et al., 25 Oct 2024).
- Assignment and Kernel Routing: Pass the scale and format metadata forward for (re)quantization or dequantization within specialized GEMM kernels (e.g., NVIDIA FP8 tensor cores, Blackwell mxFP kernels).
Granularity is highly implementation-dependent: Per-tensor schemes simplify kernel dispatch and minimize metadata; per-group/channel arrangements optimize accuracy but increase memory for scale storage and control logic.
5. Empirical Benchmarks and Engineering Impact
Across architectures (LLMs, CNNs, VLMs, ViTs), mixed FP8 quantization demonstrates:
- Memory Savings: COAT achieves end-to-end training memory reduction versus BF16 and up to reduction in activation footprint (Xi et al., 25 Oct 2024). InfiR2 and FP8-LM yield $10$– memory reduction in model/optimizer state (Wang et al., 26 Sep 2025, Peng et al., 2023).
- Throughput Gains: Reported $1.43$– speedup compared to BF16, with up to faster training under full FP8 (TWEO) (Xi et al., 25 Oct 2024, Liang et al., 28 Nov 2025).
- Accuracy/Convergence: Across Open LLM (Llama, OLMo), VLM (VILA), and reasoning (AIME24, GPQA) tasks, mixed FP8 matches BF16 within noise; uniform INT8 and even uniform FP8 can suffer large degradations in certain layers, which are avoided by mixed strategies (Xi et al., 25 Oct 2024, Dotzel et al., 2023, Zhang et al., 2023).
- Energy and TCO: FGMP delivers end-to-end energy reduction in inference (Hooper et al., 19 Apr 2025); mixed FP8 on Gaudi 2 cuts inference TCO per token by reducing both power and model time-to-completion (Kim et al., 3 Feb 2025).
6. Hardware and Implementation Considerations
Mixed FP8 quantization is most effective when coupled with hardware supporting both exponent/mantissa parameterizations and dynamic routing of kernel operations:
- Native Kernel Support: Modern architectures (e.g., NVIDIA Hopper, Blackwell) provide FP8 tensor cores—supporting E4M3/E5M2—and custom operators (e.g., MicroMix, FP8-LM) for blockwise/channelwise mapping (Liu et al., 4 Aug 2025, Peng et al., 2023).
- Control Logic and Overhead: Adding per-block mixed-precision incurs minimal area overhead (<) and negligible per-operation energy, provided kernel fusion and dataflow are optimized (Hooper et al., 19 Apr 2025, Zhang et al., 2023).
- Scale Metadata: The storage and update of scaling factors must be bandwidth- and memory-efficient; power-of-two rounding and quantized scale representations (e.g., E8M0) are frequently used (Wang et al., 26 Sep 2025, Liu et al., 4 Aug 2025).
- Software Integration: Model wrappers (PyTorch modules, custom quantize-dequantize operators), calibration routines, and format-selection heuristics are integrated into existing pipelines without hyperparameter tuning (Xi et al., 25 Oct 2024, Peng et al., 2023, Shen et al., 2023).
7. Best Practices and Deployment Guidelines
- For NLP and LLMs, default to E4M3 for weights and optimizer first moments; select E5M2 or higher range for activations/second moments exposed to outliers (Xi et al., 25 Oct 2024, Fishman et al., 19 Sep 2024, Wang et al., 26 Sep 2025, Shen et al., 2023).
- Apply per-tensor quantization for linear layers; for non-linear or outlier-prone tensors, use per-group or per-channel granularity (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025).
- Employ stateless (delayed) scaling for maximum hardware utilization and numerical stability (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025).
- Integrate architecture-aware spike detection (SAMPQ, TWEO) where activation outliers create catastrophic quantization failures—use regularization if necessary to avoid collapse (Maisonnave et al., 30 Apr 2025, Liang et al., 28 Nov 2025).
- In hardware-native settings (Hopper, Blackwell), exploit fused mixed-precision tensors and preferred scale storage for minimal code and memory overhead (Liu et al., 4 Aug 2025, Zhang et al., 2023).
- For deployment, combine mixed FP8 quantization with hybrid INT/FP8 schemes at the layer level using MoFQ or similar algorithms, tailoring format to error minimization (Zhang et al., 2023, Shen et al., 2023, Dotzel et al., 2023).
Mixed FP8 quantization is a principled, empirically validated approach for maximally efficient memory, compute, and energy usage in modern large-scale neural networks, yielding near-lossless accuracy and throughput on par or better than uniform quantization on current accelerator hardware (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025, Hooper et al., 19 Apr 2025, Peng et al., 2023, Liang et al., 28 Nov 2025, Dotzel et al., 2023, Shen et al., 2023, Zhang et al., 2023).