Mixed FP8 Formats
- Mixed FP8 formats are defined by assigning different 8‐bit configurations (e.g., E5M2, E4M3, E3M4, E2M5) to balance dynamic range and precision in quantization.
- Methodologies use calibration, empirical error analysis, and joint optimization of weight–activation pairs to select the optimal FP8 variant per tensor or layer.
- Hardware architectures support mixed FP8 execution with customized decoding, parallel multiplier paths, and high-precision accumulators, ensuring both efficiency and near full-precision performance.
Mixed FP8 formats refer to the systematic use and deployment of multiple floating-point 8-bit (FP8) number systems within a single machine learning or signal processing workflow, where the specific exponent and mantissa allocation may differ between tensors, channels, or even operations. Whereas traditional quantization approaches favored uniform formats (e.g., INT8 or a single FP8 variant), mixed FP8 quantization dynamically selects or assigns, per layer or per tensor, the most suitable FP8 format among a menu of alternatives—trading off dynamic range and resolution to minimize quantization error subject to hardware and bandwidth constraints. This paradigm exploits the observation that neural network weights, activations, and gradients exhibit diverse value distributions and dynamic range requirements that no single format can optimally accommodate. Mixed FP8 quantization thus aims to approach full-precision (FP16/FP32) fidelity while maintaining an 8-bit data path and minimizing area/power overheads (Zhang et al., 2023, Huang et al., 2021, Kuzmin et al., 2022).
1. FP8 Families and Representational Trade-Offs
FP8 number systems are parameterized by exponent bits () and mantissa (fraction) bits (), with the constraint $1 + e + m = 8$ (one sign bit, exponent, mantissa). The most commonly used variants are:
| Name | Exponent () | Mantissa () | Bias | Dynamic Range | Min Subnormal | Step near zero () |
|---|---|---|---|---|---|---|
| E5M2 | 5 | 2 | 15 | to | ||
| E4M3 | 4 | 3 | 7 | to $240$ | ||
| E3M4 | 3 | 4 | 3 | $0.125$ to $15.5$ | ||
| E2M5 | 2 | 5 | 1 | $0.5$ to $3.94$ |
E5M2 maximizes range at the cost of granularity; E4M3 and E3M4 provide finer steps near zero but reduced range (Zhang et al., 2023).
Mixed FP8 frameworks, such as those described in (Zhang et al., 2023, Kuzmin et al., 2022), allow per-layer or per-tensor selection among these variants. E.g., weights and activations can be encoded with higher precision (more ), whereas gradients requiring wider dynamic range use higher (E5M2).
Flexible FP8 quantization can extend beyond IEEE-recommended layouts; some methodologies select along with exponent bias and even presence of the sign bit, as in the FFP8 model (Huang et al., 2021).
2. Mixed-Precision Quantization Methodologies
Modern mixed FP8 quantization algorithms deploy empirical and analytic search procedures to assign optimal per tensor or channel, leveraging a calibration set to estimate the expected error for each candidate format. The main stages are:
- Unified quantization and resolution analysis: Both integer and floating FP8 quantizers are formalized in terms of effective resolution and error metrics (mean-squared error, upper-bounded by ) (Zhang et al., 2023).
- Format selection process: For each tensor, error statistics (MSE, clip fraction) are measured for the candidate FP8 formats, and the configuration minimizing error (while respecting clipping constraints) is chosen (Zhang et al., 2023, Huang et al., 2021, Kuzmin et al., 2022).
- Joint optimization for weight–activation pairs: For matrix multiplications, the pair is chosen to minimize on a small calibration set (Zhang et al., 2023).
- Hierarchies of search: Modes range from All Mixed (arbitrary mixture of INT8 and all FP8 variants, per layer), to Mixed FP8 (restriction to FP8 only), to Limited Mix (weights and activations matched) (Zhang et al., 2023, Huang et al., 2021).
The overall quantization workflow is strictly post-training, requiring no retraining or forward/backward modifications unless quantization-aware training is specifically invoked (Huang et al., 2021, Kuzmin et al., 2022).
3. Hardware Architectures for Mixed FP8 Execution
Mixed-precision FP8 workloads impose stringent requirements on hardware, especially with respect to support for multiple formats and seamless format switching. Architectures typically feature:
- Parametrized decoding and arithmetic paths: Decoders map 8-bit FP8 words into internal representations (e.g., FP11, FP19, INT9), feeding into multiplier trees sized for each variant (Zhang et al., 2023, Rout et al., 19 Nov 2025).
- Parallel multiplier paths: INT8 and FP8 paths operate side-by-side (e.g., INT9 8×8, FP8 5×5), with minimal area overhead (<5%) relative to pure INT8 engines, enabling runtime selection of format without pipeline stalling or resource underutilization (Zhang et al., 2023, Rout et al., 19 Nov 2025, Tahmasebi et al., 27 Nov 2024).
- Accumulator design: Mixed-dot engines accumulate products in high-precision accumulators (FP32), maintaining numerical fidelity even as multipliers operate on low-precision FP8 (Rout et al., 19 Nov 2025).
- Microcoded control for on-the-fly reconfiguration: Emerging designs like FlexiBit use per-layer microcode control to instantaneously reformat PE datapaths for their assigned (Tahmasebi et al., 27 Nov 2024). This enables full area and bandwidth efficiency across arbitrary FP8 variants without hardware idling or padding.
- Scalable casting and conversion: Lightweight on-chip cast units unpack/repack between FP8 variants and higher-precision internal formats as inputs/outputs traverse the hardware pipeline, making per-layer mixed deployment feasible (Huang et al., 2021, Tortorella et al., 2023).
4. Algorithmic Strategies and Adaptive Assignments
Mixed FP8 quantization leverages the statistics of each tensor, using both data-driven calibration and analytic modeling:
- Clipping minimization: Each tensor’s is chosen to ensure the vast majority of data fits inside representable dynamic range, discarding candidates exceeding a user-set clipping threshold (typically a few percent) (Huang et al., 2021, Kuzmin et al., 2022).
- Precision maximization: Among candidates that cover the needed range, the format with largest is picked to minimize rounding error unless specific application constraints (e.g., resilience to extreme outliers) dictate otherwise (Huang et al., 2021, Kuzmin et al., 2022).
- Empirical search vs. analytic profile: Networks with near-Gaussian distributions can be assigned lower (e.g., E5M2), but layers with heavy-tailed or outlier-prone statistics (e.g., transformers, attention mechanisms, early convolutions in segmentation) benefit from higher and lower (E4M3, E3M4, or even E2M5) (Kuzmin et al., 2022, Zhang et al., 2023).
- Calibration-driven layer-wise mapping: Small calibration datasets (200–300 samples) suffice for stable format assignment across typical vision and LLMs (Zhang et al., 2023, Huang et al., 2021).
State-of-the-art systems such as MicroMix further allow mixed FP8/FP6/FP4 “channel-wise” assignments with block scaling, dynamically grouping activations into sub-tensors with higher or lower-precision MXFPx for each linear layer, based on analytically-derived error upper bounds (Liu et al., 4 Aug 2025).
5. Practical Impact, Empirical Results, and Limits
The effects of mixed FP8 formats are quantitatively demonstrated across architectures and workloads:
- On ImageNet classification (ResNet, MobileNet, ViT), mixed FP8 recovers within 0.2 points of FP32 top-1 accuracy; All Mixed approaches close to 0.1 points (Zhang et al., 2023). INT8 alone typically loses 0.3–5 points, especially in models with high activation variance.
- On COCO detection and Cityscapes segmentation, mixed FP8 closes nearly all of the gap to full-precision: e.g., Retina-FPN mAP 36.85 (Mixed FP8) vs. 37.00 (FP32) (Zhang et al., 2023).
- Natural language tasks (e.g., BERT on GLUE): FP8-based schemes lose \% relative to FP32 baselines; INT8 quantization yields ≈4.3% performance drop (Zhang et al., 2023).
- Ablation demonstrates subnormal handling is critical; disabling subnormals can catastrophically degrade accuracy (ResNet-50 drops to Top-1 when subnormals are disabled) (Zhang et al., 2023).
- Mixed FP8 remains effective at bit-widths down to 6, outperforming INT6 which yields catastrophic loss ( retained) (Zhang et al., 2023).
- In inference and training, mixed FP8 reduces memory footprint and bandwidth by 2× compared to FP16 (Micikevicius et al., 2022). In advanced LLM training pipelines, end-to-end FP8 schemes reduce HBM usage by up to 39% and accelerate model throughput by 21–75% versus BF16 (Peng et al., 2023).
6. Software, Integration, and Best Practices
- No retraining is necessary to benefit from mixed FP8 quantization in post-training quantization (PTQ); profiling and calibration suffice (Huang et al., 2021, Kuzmin et al., 2022).
- Per-tensor/per-layer format records or control registers are updated at runtime to instruct hardware on the correct FP8 decoding/encoding, with minimal overhead (Huang et al., 2021, Tahmasebi et al., 27 Nov 2024).
- In training scenarios, scaling factors per tensor and global auto-scaling procedures are deployed to mitigate overflow and underflow, with delayed scaling for distributed gradient communication (Peng et al., 2023).
- For fine control, joint learning of scaling and exponent bits can be incorporated into quantization-aware training (QAT) via straight-through estimators, enabling the layerwise format distribution to evolve adaptively (Kuzmin et al., 2022).
- Empirical studies recommend weights and activations in higher-precision FP8 (E4M3, E3M4), gradients in wide-range FP8 (E5M2), and always performing accumulation in at least FP16 to preserve numerical stability (Huang et al., 2021, Zhang et al., 2023, Peng et al., 2023).
- Hardware targets (H100, Blackwell, custom GPGPU, ASIC) should support both mainline and “non-standard” FP8 variants (E3M4, E2M5) to extract maximum benefit from mixed-precision mapping (Tahmasebi et al., 27 Nov 2024, Liu et al., 4 Aug 2025, Zhang et al., 2023).
7. Limitations, Future Directions, and Architectural Evolution
Despite their clear empirical benefits, the adoption of mixed FP8 formats presents architectural and design challenges:
- Static hardware can limit the effective range of supported FP8 formats to pre-defined sets (e.g., E4M3/E5M2 in NVIDIA products), constraining the full potential of adaptive quantization. Fully bit-parallel parametric designs (FlexiBit) remove this limitation at minimal area/power penalty, enabling truly arbitrary E, M assignments (Tahmasebi et al., 27 Nov 2024).
- Mixed-precision deployment requires fine-grained software–hardware co-design, including runtime control of format selection and on-the-fly format-specific computation graphs or microcode streams (Tahmasebi et al., 27 Nov 2024, Huang et al., 2021).
- As networks diversify (dense, sparse, LLMs, diffusers, Mixture-of-Experts), workload-specific optimal format signatures will require adaptive calibration and possibly in-training format learning (Kuzmin et al., 2022).
- The transparency and correctness of numerical results in extreme quantization regimes depend critically on correct handling of subnormals, rounding, and exceptional values—failure to implement these can result in silent loss of accuracy or instability (Zhang et al., 2023).
Further research aims to unify automatic quantization, hardware autotuning, and theoretical characterization of FP8 error propagation to enable robust deployment of mixed-precision FP8 across all advanced deep learning workloads.