FP8 Floating-Point Quantization for Neural Networks

Updated 17 March 2026

Floating-Point Quantization (FP8) is an 8-bit format that uses a logarithmic grid to represent tensors, offering tunable precision and an extended dynamic range for accurate deep learning computations.
The approach scales, casts, and dequantizes data using per-tensor or per-channel calibration to minimize quantization errors and closely preserve full-precision performance.
FP8 quantization delivers practical improvements with up to 4× memory savings and 2-4× throughput gains on modern AI accelerators, making it critical for large-scale model inference and training.

Floating-Point Quantization (FP8) is the representation and computation of tensors (weights, activations, gradients) in neural networks using 8-bit floating-point number formats. Unlike fixed-point or integer quantization, which maps values to a uniform grid, FP8 employs a logarithmic quantization grid defined by a limited number of exponent and mantissa bits, offering both extended dynamic range and a degree of precision tunability. FP8 quantization has emerged as a critical enabler of efficient deep learning inference and training—especially for large-scale models—given the widespread adoption of native FP8 arithmetic in contemporary AI accelerators.

1. FP8 Number Formats and Properties

FP8 encodings are generally specified by the tuple (s, e, m), where s is the sign bit, e the number of exponent bits, and m the number of mantissa (fraction) bits, with s + e + m = 8. The two most widely deployed FP8 formats, now standard in both hardware and software stacks, are:

Format	Sign	Exponent Bits	Mantissa Bits	Bias	Dynamic Range (approx)	Min Normal	Max Normal
E4M3	1	4	3	7	[2⁻⁶, 448]	0.0156	448
E5M2	1	5	2	15	[2⁻¹⁴, 57 344]	6.10e-5	57 344

The value of an FP8-encoded scalar $x$ with sign $s$ , exponent $E$ , and mantissa $M$ is (for normalized values):

$x = (-1)^s \cdot 2^{E - \text{bias}} \cdot \left(1 + \frac{M}{2^{m}}\right)$

FP8 supports both normal and subnormal numbers; special-value handling (Inf, NaN) varies by encoding but is generally preserved in IEEE-compliant formats (e.g., E5M2), or repurposed to increase dynamic range (e.g., E4M3) (Micikevicius et al., 2022).

FP8’s key advantage over INT8 is its non-uniform, exponentially-spaced grid, which enables large dynamic range and higher resolution near zero, permitting better representation of heavy-tailed and outlier-prone distributions encountered in deep models (Li et al., 2023, Kuzmin et al., 2022). The trade-off between range (driven by exponent bits) and precision (driven by mantissa bits) is critical; formats with more exponent bits outperform on outlier-heavy data, while more mantissa bits yield lower mean-squared error on "well-behaved" (e.g., Gaussian) distributions.

2. Quantization and Dequantization Algorithms

FP8 quantization typically involves the following workflow:

Scaling: For a tensor $r$ (weight or activation), determine a scale factor $S>0$ (per-tensor, per-channel, or per-group) such that $|r|/S$ fits the range of the chosen FP8 format. For E4M3, $S = \max(|r|)/448$ .
Casting: Each element is quantized using a rounding operator (to nearest FP8 value, typically ties to even) and clamped to the normalized FP8 range:

$Q_\text{fp8}(r) = \text{Cast}_\text{fp32}\to\text{fp8}(r/S)$

Dequantization: Convert back to full-precision by reversing the scale:

$D_\text{fp8}(q) = S \cdot q$

Best practice is to use channel-wise scales for weights and layer-wise for activations (Li et al., 2023, Shen et al., 2023). Quantization and dequantization are generally implemented with efficient vectorized mapping and hardware-friendly rounding (nearest-even or stochastic, depending on the system) (Micikevicius et al., 2022, Shen et al., 2023).

For mixed-precision and quantization-aware workflows, the choice of formatting can be made per-tensor based on minimizing quantization MSE or other error criteria; this can be searched efficiently during calibration (Zhang et al., 2023, Dotzel et al., 2023).

3. Calibration and Mixed-Precision Search

Calibration drives the selection of scale factors and, for flexible formats, format selection per tensor or layer. Typical PTQ (Post-Training Quantization) calibration involves:

Feeding a small calibration set (1k-8k samples) through the high-precision model.
Recording extrema (min/max) or statistical moments (mean, variance) per tensor.
Computing scale factors $S$ as $\max(|r|)/X_{\max,\text{fp8}}$ (for static scaling).
Optionally searching among candidate FP8 formats (e.g., E5M2, E4M3, E3M4) to minimize layer-wise MSE or maximize representation of the target value distribution (Zhang et al., 2023, Dotzel et al., 2023, Kuzmin et al., 2022).

Flexible mixed-precision frameworks select the format per layer that minimizes quantization error or maximizes empirical accuracy on a calibration set, enabling “mixture-of-formats” quantization for further accuracy recovery without increasing hardware cost (Zhang et al., 2023).

For aggressive quantization (e.g., INT4/FP4), learned rounding (with differentiable surrogates for rounding, e.g., AdaRound-style for FP formats) can further minimize quantization error (Chen et al., 2024, Aggarwal et al., 2023).

4. Empirical Accuracy, Performance, and Applications

FP8 quantization achieves near-baseline or full-precision accuracy across a wide range of models when properly calibrated and used with calibration and mixed-precision best practices:

Transformer-based NLP Models: On BERT-Base, PTQ with E4M3 recovers full-precision GLUE accuracy, while INT8 PTQ catastrophically fails (drops of 50+ points on many GLUE tasks). FP8 also recovers nearly all SQuAD F1/EM (Li et al., 2023).
LLM Inference: WaA (weight and activation) 8-bit and W4A8 pipelines with FP8 activations (plus INT4/FP4 weights and mixed-precision selection) match or exceed FP16 on Llama and OPT, with MoFQ recovering within 0.1 PPL of FP16 over baselines (Zhang et al., 2023, 2505.20839).
Diffusion/Image Models: FP8 quantization enables ≤0.1 FID loss on Stable Diffusion and LDM, while INT8 degrades sample quality noticeably (Chen et al., 2024, Shen et al., 2023).
Vision and ViT/Transformer Models: Mixed FP8 (E4M3/E3M4) architectures match or exceed INT8 top-1 accuracy on ResNet, ViT, and segmentation tasks (Zhang et al., 2023, Micikevicius et al., 2022).
Fine-tuning and Training: Shifted and Squeezed FP8 (S2FP8) enables stable, low-precision end-to-end training with no loss-scaling and minimal accuracy degradation (<1pt for ImageNet, 0pt for Transformer BLEU) (Cambier et al., 2020). FP8-Flow-MoE accelerates MoE training and reduces memory by >16GB/GPU at full stability (Wang et al., 4 Nov 2025). LoRA fine-tuning can be accelerated with FP8 via merged adapter techniques (Choi et al., 28 Oct 2025).

FP8 quantization yields 4× memory savings and significant throughput increase (2-4× on Hopper-class hardware), with speedup limited primarily by memory bandwidth and operator support rather than ALU capacity (Li et al., 2023, Kim et al., 3 Feb 2025).

5. Hardware Implementation and Efficiency

Modern accelerators, including NVIDIA Hopper (H100), Intel Gaudi2, AMD CDNA3, and custom FPGAs, support native FP8 arithmetic. Empirical and area models show that:

FP8/INT8 MAC units: 8-bit FP/INT multipliers with 32-bit accumulators are cost-equivalent in logic and area at 8 bits; FP8 area overhead is <5% on constrained designs (Zhang et al., 2023, Zhang et al., 2023).
Energy and Latency: On digital compute-in-memory (CIM) arrays, FP8 achieves up to 2.8× higher energy efficiency than previous FP8 macros; flexible, shift-aware bitwidth adaptation on FP8 mantissas (DSBP) allows for tunable tradeoff between accuracy and efficiency (Zhao et al., 5 Feb 2026).
Operator Support: Nonlinearities (LayerNorm, GELU, Softmax) are typically kept in BF16/FP16 to prevent accuracy degradation (Li et al., 2023, Shen et al., 2023).
FP8 in FPGAs: Minifloat multiply-accumulate chains (3–8 bits total) can approach INT8 accuracy-resource Pareto at higher bitwidths, with FP8/FP6 better suited for outlier-rich tasks (e.g., ViT), while INT8 is still optimal for tight resource envelopes (Aggarwal et al., 2023).

Emergent deployment paradigms include mixed-precision (MoFQ), dynamic runtime adaptation (NestedFP), continuous FP8 flows (FP8-Flow-MoE), and joint integer/FP search—each leveraging hardware support to minimize overhead (Lee et al., 29 May 2025, Wang et al., 4 Nov 2025, Dotzel et al., 2023).

6. Analysis: Advantages, Trade-offs, and Limitations

Advantages:

Dynamic Range: The exponent field enables accurate quantization of outlier-heavy or widely varying tensors. This is crucial for transformers, LLMs, diffusion models, and other architectures where INT8 PTQ fails (Li et al., 2023, Kuzmin et al., 2022).
Precision Tunability: Choice of mantissa/exponent bits allows tuning the trade-off: more mantissa for precision (E3M4, E2M5), more exponent for outlier robustness (E5M2, E4M3) (Kuzmin et al., 2022, Zhang et al., 2023).
Operator Coverage: FP8 can be applied to a broader set of operators—including elementwise, layernorm, and convolution—than uniform-grid INT8 (Shen et al., 2023).
Scalable Efficiency: 4× memory savings and up to 2× hardware throughput, with soft hardware and negligible area/power penalties at 8-bit (Li et al., 2023, Kim et al., 3 Feb 2025, Zhao et al., 5 Feb 2026).

Limitations:

Clipping/Saturation: Small or very large models or extremely quantization-sensitive pathways may lose >1pt accuracy; E5M2 formats or QAT can mitigate but not always eliminate this (Li et al., 2023, Micikevicius et al., 2022).
Non-uniform grid: Interpretation and error analysis are more complex than for uniform INT8 and can require empirical tuning.
On-device Inference: On edge hardware without native FP8, INT8/INT4 remains preferable due to hardware cost and efficiency; FP8 is most effective in datacenter settings (Baalen et al., 2023, Zhang et al., 2023).
Operator Coverage and Pipelines: Certain numerically sensitive ops (e.g., Softmax) must remain at higher precision in a mixed pipeline (Li et al., 2023, 2505.20839).
Calibration dependence: Performance is highly sensitive to scale calibration and selection of per-channel vs. per-tensor quantization (Li et al., 2023, Zhang et al., 2023).

7. Best Practices and Practical Guidelines

Format selection: Use E4M3 for NLP/LLMs and E3M4 for computer vision, based on empirical workload coverage (Shen et al., 2023).
Mixed precision: Consider mixed formats or per-layer format selection (as in MoFQ or FLIQS) for the best accuracy-efficiency trade-off (Zhang et al., 2023, Zhang et al., 2023, Dotzel et al., 2023).
Calibration: Apply per-channel scaling for weights and per-layer scaling for activations; 1k–8k calibration samples suffice (Li et al., 2023, Shen et al., 2023).
Algorithmic tuning: Employ learned bias, per-channel flexible bias (instead of global exponent bias), and, in challenging low-bitwidth scenarios, learned rounding (Kuzmin et al., 2022, Chen et al., 2024).
Operator policy: Retain BF16/FP16 for non-linearities such as LayerNorm/GELU/Softmax (Li et al., 2023).
Hardware targeting: Prioritize FP8 on platforms with native support (H100, Gaudi2, AMD CDNA3, custom FPGAs with CIM arrays), falling back to INT8 elsewhere (Zhao et al., 5 Feb 2026).
Task adaptation: Use E4M3/E5M2 in LLMs, E3M4/E4M3 in ViTs, aggressive minifloats or signed/unsigned FFP8 for distribution-matched quantization in resource-limited CPUs or FPGAs (Huang et al., 2021, Aggarwal et al., 2023).

FP8 quantization now constitutes a foundational primitive in both research and production for efficient, accurate deep neural network computation, with the flexible deployment and tunable format design overcoming the accuracy and operator barriers encountered in prior INT8-driven quantization regimes (Li et al., 2023, Shen et al., 2023, Zhang et al., 2023, Micikevicius et al., 2022).