Flexible FP8 Quantization
- Flexible FP8 quantization is a method that dynamically allocates exponent and mantissa bits in 8-bit floating-point formats to balance dynamic range and precision.
- It employs both static and dynamic scaling strategies with varying granularity (per-tensor, per-channel, block-wise) to optimize hardware throughput and memory usage.
- Empirical results demonstrate <1% accuracy degradation while achieving substantial speedups and reduced memory footprint in large-scale deep learning models.
Flexible FP8 quantization encompasses a broad family of procedures and numerical formats that exploit the flexibility in assigning exponent and mantissa bits within 8-bit floating-point representations for neural network inference and training. By tuning the assignment of dynamic range versus precision, scaling strategy, and quantization granularity, practitioners can maximize algorithmic accuracy, hardware throughput, and memory efficiency for diverse deep learning architectures, all while maintaining <1% accuracy degradation in state-of-the-art large model deployments (Lee et al., 13 Mar 2025).
1. FP8 Format Definitions and Parameterization
The core of flexible FP8 quantization is the selection and parameterization of 8-bit floating-point formats. Let an FP8 datum be specified as:
Where
- 1-bit sign
- exponent bits
- mantissa bits
- exponent bias
Standardized formats include E4M3 (, , bias, range to $448$ on different hardware) and E5M2 (, , bias, range ) (Lee et al., 13 Mar 2025). The allocation of and can be dynamically chosen per layer, channel, or tensor, responsive to each layer's dynamic range and precision requirements (Aggarwal et al., 2023, Kuzmin et al., 2022, Huang et al., 2021).
For further flexibility, formats such as E3M4 and E2M5 are supported in some frameworks, and even the bias can be varied as a tunable parameter—a crucial feature in FFP8 (Huang et al., 2021).
2. Quantization Pipeline and Scaling Strategies
Flexible FP8 quantization proceeds via a two-phase process: scaling and casting to 8-bit floating-point. For any tensor ,
- Compute the scale (chosen per-tensor, per-channel, per-block, or per-token):
- Per-tensor: with
- Per-channel:
- Block-wise:
- Quantize: , where
- Cast to FP8 bit-pattern using the target encoding (e.g., E4M3, E5M2)
- Dequantize at inference or in higher-precision accumulators: (Lee et al., 13 Mar 2025)
Two main scaling regimes are used:
- Static scaling: Calibration (collect max, percentiles, histograms) performed offline; scales are fixed during inference. Fast, but risks overflow/underflow on out-of-domain data.
- Dynamic scaling: Recompute scale per batch or per group at runtime. Ensures high quantization fidelity, especially for rare activation spikes, at the cost of minor overhead (Lee et al., 13 Mar 2025, Li et al., 2023).
Hardware acceleration often restricts to power-of-two values (e.g., Intel Gaudi2/3: or for ), enabling exponent manipulation instead of multiplications (Lee et al., 13 Mar 2025).
3. Granularity and Mixed-Precision Heuristics
Granularity of scaling and quantization dictates the trade-off between hardware efficiency, memory, and accuracy:
- Per-tensor: Cheapest in storage and fastest (best for large GEMMs), but may introduce up to ~0.3% loss (PPL, MMLU).
- Per-channel: Superior accuracy (<0.1% drop), at modest scale memory and latency cost (Lee et al., 13 Mar 2025, Aggarwal et al., 2023).
- Block-wise: Future hardware support expected; intermediate accuracy and efficiency (Lee et al., 13 Mar 2025, Aggarwal et al., 2023).
- Per-token: For activations in transformers, preserves small features/outliers at minimal accuracy loss (Wang et al., 26 Sep 2025).
Mixed-precision workflows frequently deploy different FP8 formats for weights, activations, and gradients. Common heuristics include:
- E4M3 for forward/inference, maximizing precision if dynamic range permits.
- E5M2 (or higher exponent-count) for gradients, moments, or activations with frequent outliers (Lee et al., 13 Mar 2025, Fishman et al., 19 Sep 2024).
- Per-layer or per-channel format adaptation, with grid or optimizations to minimize quantization MSE or task loss (Aggarwal et al., 2023, Huang et al., 2021).
4. Algorithmic Best Practices and Empirical Impact
Implementation should adhere to empirically substantiated best practices:
- Offline quantization: Quantize weights per-output-channel for best accuracy. Power-of-two rounded scaling is preferable for compatibility with fast hardware.
- Online quantization: Activations quantized per-tensor statically for throughput, per-channel or dynamic if accuracy is paramount.
- First and last layers (“embedding” and “lm-head” in transformers) may be left in higher precision or exempted if quantization induces >1% degradation (Lee et al., 13 Mar 2025, Li et al., 2023).
- Use a “backoff” factor on scale to reduce overflow.
- For transformer-layer PoC, see pseudocode in (Lee et al., 13 Mar 2025).
The accuracy penalty is tightly bounded: in Llama2/3 and Mistral models (7B–70B parameters), per-tensor scaling costs <2% in PPL and <0.5% on MMLU, and per-channel scaling reduces this further to <0.3% (Lee et al., 13 Mar 2025). In vision and language tasks, FP8 quantization outperforms INT8 in both accuracy and stability, e.g., recover within 0.1–0.3% of FP32 accuracy (Zhang et al., 2023, Shen et al., 2023, Li et al., 2023, Kuzmin et al., 2022).
Large-matrix matmuls (40963–81923) reach up to 865 TFLOPs at >92–98% MFU (machine floating-point unit utilization) using hardware-accelerated scaling, compared to ~430 TFLOPs in BF16—a near 2× speedup. Memory footprint is halved, enabling large models (e.g. 70B LLMs) on single devices without tensor-parallel sharding (Lee et al., 13 Mar 2025).
5. Design Trade-Offs: Dynamic Range vs. Precision
Increasing exponent bits () expands the dynamic range (), beneficial in layers with heavy-tailed or outlier-ridden distributions (e.g., transformer activations, normalized representations). Applications requiring tight MSE minimization on bell-shaped, light-tailed distributions (e.g., many CNNs), instantiate more mantissa bits () to enhance precision (Kuzmin et al., 2022, Aggarwal et al., 2023, Huang et al., 2021).
Quantization MSE decomposes into rounding and clipping: For heavy-tails, more exponent bits reduce clipping by accommodating outliers; for tightly bounded weights/activations, fewer exponents can be used to maximize local precision (Kuzmin et al., 2022).
Selection of per-layer format may be grid-searched post hoc (PTQ) or even optimized “in the loop” via quantization-aware training (simulating different as differentiable variables) (Kuzmin et al., 2022, Aggarwal et al., 2023, Huang et al., 2021). In quantization-aware training, flexible division of exponent and mantissa bits loses advantage as networks regularize the effect of outliers and self-adapt to hardware-imposed limits (Kuzmin et al., 2022).
6. Hardware Implementation and Compatibility
Production-worthy implementations demand hardware with first-class FP8 arithmetic support, including decoder, mantissa/exponent arithmetic logic, and accumulation in BF16/FP32. Power-of-two scales can be implemented via exponent-bias adjustment in hardware, avoiding costly per-element multiplication (Lee et al., 13 Mar 2025).
Hardware-support for multiple FP8 formats (E4M3, E5M2) is common (Intel Gaudi, NVIDIA H100), and efficient MACs with pipeline depth of 2 stages and minimal LUT usage can be synthesized for custom minifloat formats (Aggarwal et al., 2023, Huang et al., 2021). INT8×FP8 multiply is generally unsupported to avoid hybrid datapaths (Zhang et al., 2023). Integration with model deployment pipelines is further simplified by supporting direct FP32-to-FP8 casting, and 8→32 converters with negligible area/latency overhead (Huang et al., 2021).
7. Future Directions and Limitations
Flexible FP8 quantization continues to expand its design envelope:
- Block-wise scaling and scaling-aware transpose operators eradicate double quantization artifacts, enabling nearly cast-free dataflows (FP8-Flow-MoE) with only two Q/DQ boundaries for 21% throughput and 16.5 GB memory improvements in 671B MoE models (Wang et al., 4 Nov 2025).
- Layer-adaptive scaling, learned (shifted/squeezed) scaling, and quantization-aware training “learn” the optimal dynamic range/precision compromise in-situ (Cambier et al., 2020).
- Hybrid-granularity or token-wise scaling strategies are applied in transformer LLM pretraining/fine-tuning for near-lossless reasoning accuracy and 14–22% speedups (Wang et al., 26 Sep 2025).
- Power-law “dynamic range expansion” and mixed per-tensor/per-group quantization in COAT enable full-parameter FP8 training with 1.5× memory and 1.4× speedup over BF16 (Xi et al., 25 Oct 2024).
- Efficient per-tensor FP8 quantization, unlocked by suppressing mechanical activation outliers (TWEO), enables W8A8 PTQ (static, symmetric) for state-of-the-art LLMs, fundamentally shifting quantization software/hardware co-design (Liang et al., 28 Nov 2025).
In summary, flexible FP8 quantization orchestrates format, scaling, and granularity choices across tensors and layers to attain near full-precision model accuracy with substantial memory and computational savings. These mechanisms are now robustly validated across vision, language, and generative architectures with both empirical and analytical support (Lee et al., 13 Mar 2025, Kuzmin et al., 2022, Li et al., 2023, Huang et al., 2021).