FP8 Quantization Workflow

Updated 4 December 2025

FP8 quantization workflow is a hardware/software pipeline that uses 8-bit floating-point formats (e.g., E4M3, E5M2) to efficiently represent neural network parameters and activations.
It addresses numerical instability and scaling errors through custom regularization, dynamic scaling, and mixed-precision allocation, ensuring robust training and inference.
The workflow leverages advanced hardware support from NVIDIA Hopper, Intel Gaudi, and FPGAs, and is applied in large-scale models such as LLMs, MoE, and video diffusion.

Floating-Point 8-Bit (FP8) Quantization Workflow

Floating-point 8-bit (FP8) quantization defines hardware and software pipelines for representing and computing neural network parameters and activations using 8-bit floating-point formats. The goal is to reduce memory and computational costs while retaining statistical integrity and performance, especially in large-scale settings such as LLM pretraining, MoE models, video diffusion, and vision/transformer architectures. Recent hardware platforms (NVIDIA Hopper, Intel Gaudi, RTX 4090, FPGAs) provide native FP8 support, making these workflows operationally relevant. Fundamental challenges addressed by modern FP8 workflows include numerical instability due to outliers, error amplification from scaling inconsistencies, and the need for efficiency without architecture-specific hacks. Solutions span custom regularization, dynamic granularity, mixed-precision allocation, specialized loss modifications, and hardware-aware pipelines for both training and post-training quantization.

1. FP8 Format Fundamentals and Quantization Principles

FP8 quantization relies on binary layouts with a 1-bit sign, various exponent (E) and mantissa (M) arrangements (e.g., E4M3, E5M2, E3M4), and explicit scale factors driving affine mapping between high-precision tensors and FP8 codes. Each FP8 number is determined by:

$\mathrm{FP8}_{(s,e,m)} = (-1)^s \cdot 2^{e - \text{bias}} \cdot (1 + m/2^{M})$

E4M3 (4 exponent, 3 mantissa): dynamic range up to ±448, typical for forward pass weights/activations (Liang et al., 28 Nov 2025, Li et al., 2023, Wang et al., 26 Sep 2025, Lee et al., 13 Mar 2025).
E5M2 (5 exponent, 2 mantissa): extended range up to ±57344 (Intel Gaudi), used for gradients or optimizer moments (Choi et al., 28 Oct 2025, Wang et al., 26 Sep 2025).

Key protocols:

Compute a real-valued scale $s$ (per-tensor, per-block, or per-channel) to match the tensor's dynamic range to that of the chosen FP8 format.
Quantize: $q_i = \mathrm{round}(x_i / s)$ , clamped to FP8's code range.
Dequantize: $\hat{x}_i = q_i \cdot s$ .

Rounding is typically to nearest-even, though stochastic and learned rounding are also supported. All approaches require calibration to determine optimal scales, either by min/max, MSE minimization, or multi-batch statistics. Subnormals and bias representations vary between hardware.

2. End-to-End Training in FP8: Outlier Regularization and Quantization Stability

Modern accelerators expose catastrophic numerical pathologies under unmodified FP8 training, most notably due to extremely large and infrequent activation outliers that force scale inflation and data collapse (Liang et al., 28 Nov 2025). Contrary to prior assumptions, recent work attributes these outliers to data-independent, weight-input colinearity artifacts. The Transformers Without Extreme Outliers (TWEO) framework introduces a generic $L_p$ -style regularization penalty to block outputs:

$\mathcal{L}_{\mathrm{TWEO}} = \frac{1}{L} \sum_{l=1}^{L} \mathbb{E}_{\text{batch,seq,dim}}\left[ \left(\frac{|A^{(l)}|}{\tau+\epsilon}\right)^p \right]$

with $p=4$ , $\tau=3$ , $\epsilon=10^{-6}$ , and $L$ the number of blocks (Liang et al., 28 Nov 2025).

Total loss combines the standard task loss and the TWEO penalty, optionally annealed. This approach sharply reduces activation outliers from $O(10^4)$ to $<20$ , enabling stable full-model FP8 pretraining and removing the need for invasive control-path splitting or module-by-module BF16 fallback.

TWEO and analogous techniques allow scale selection via simple online amax buffers without risk of field overflow. In large-scale LLMs (e.g., GPT-2 350M–1.6B), direct FP8 training with TWEO matches or exceeds BF16 perplexity, and supports fully quantized (W8A8) inference with negligible additional post-training accuracy loss (Liang et al., 28 Nov 2025).

3. Advanced Quantization Workflows: Hybrid Granularity, MoE, and Hardware Integration

Large-scale workflows integrate FP8 with further innovations to maximize both theoretical and realized efficiency. Key developments:

Hybrid-Granularity Quantization: FP8 is applied per-token for activations and per-block for weights (with block aligned to GEMM tile size), balancing flexibility and hardware throughput (Wang et al., 26 Sep 2025). For example, in the InfiR2 workflow, scaling factors are calibrated per-token (activations) or per-block (weights), and rounded to powers-of-two for hardware compatibility. All master weights and optimizer states remain in FP32, minimizing cumulative error.
FP8-Flow-MoE: In MoE models, naive FP8 training accumulates double quantization error when layouts change. The FP8-Flow-MoE algorithm solves this via a scaling-aware transpose operator that exactly re-aligns FP8 scales in blockwise tensors, replacing explicit row→col dequant/requant with integer exponent manipulation (Wang et al., 4 Nov 2025).
Fused Operators and Kernel Co-Design: Fused FP8 quantization, permutation, padding, and post-activation quant enable minimal-cast dataflows and maximize arithmetic intensity on NVIDIA Hopper and similar architectures. All operations are orchestrated so that only at the final nonlinearity or reduction stage is a single dequantize performed, exposing BF16-level convergence with up to 21% higher throughput and >16.5GB lower memory per GPU in trillion-token MoEs (Wang et al., 4 Nov 2025).
Video Diffusion Acceleration: FPSAttention applies tilewise FP8 scaling (3D spatial-temporal blocks) in 3D attention for video diffusion models, with denoising-step-aware granularity adaptation and hardware orchestration using fused Triton-based kernels. Attention tile sizes and sparsity adapt to denoising schedule for optimal error allocation. Empirically, this enables >7× attention kernel and ~5× end-to-end generation speedup at 720p (Liu et al., 5 Jun 2025).

4. Post-Training Quantization and Format Selection Strategies

FP8 post-training quantization (PTQ) generalizes across vision, language, and generative models. The workflow comprises:

Calibration: Using O(100–5000) calibration samples, collect per-tensor or per-channel amax for activations and weights. Applies across BERT, ResNet, MobileNet, and LLMs (Li et al., 2023, Shen et al., 2023, Aggarwal et al., 2023).
Scale and Quantization: For each tensor, compute $s = \max(|X|)/\mathrm{MaxFP8}$ (symmetric, zero-point at zero). For weights, per-channel calibration minimizes error; for activations, per-tensor (or per-layer) is common.
Mixed Precision and Heuristics: Use value distribution statistics to select the optimal format: E4M3 is preferred for outlier-heavy (NLP) activations; E3M4 for vision or zero-centered data; E5M2 where dynamic range limit is constraining (Shen et al., 2023, Li et al., 2023).
Operator Coverage: Only GEMM kernels (or Conv for CV) are typically quantized, with LayerNorm/BatchNorm and softmax left unquantized. This approach recovers $>92\%$ coverage (within $1\%$ FP32 accuracy) across >75 network architectures, vastly exceeding INT8's $\sim66\%$ (Shen et al., 2023).

Empirical results confirm FP8-E4M3 PTQ matches or surpasses FP32 on BERT (GLUE, SQuAD), ViT, and ResNet-50, and stabilizes quantization error versus INT8 in mobile/efficient nets. Mixed-precision frameworks (FLIQS, flexible frameworks) automate per-layer format selection to minimize quantization error under hardware cost constraints (Dotzel et al., 2023, Zhang et al., 2023).

5. Hardware Pipelines and Implementation Details

FP8 workflows are tightly coupled to accelerator topology and operator support:

NVIDIA Hopper and Tensor Cores: Native E4M3/E5M2 GEMM, blockwise scaling, fused quantization and communication paths (TransformerEngine, DeepEP, FP8-Flow-MoE) (Wang et al., 4 Nov 2025, Wang et al., 26 Sep 2025). Fused allgather, dynamic per-block scaling, and power-of-two scale rounding for efficient accumulation. Rowwise/blockwise scaling granularity is chosen to match hardware tile sizes (Or et al., 21 Jul 2025).
Intel Gaudi: Dedicated "matrix fused units" supporting FP8-E4M3/E5M2 with per-tensor or per-channel scale. Best throughput with static per-tensor scale; dynamic (on-the-fly) scale recalibration supported. Power-of-two scaling enables exponent-bias shifting and reduces arithmetic overhead (Lee et al., 13 Mar 2025, Fishman et al., 19 Sep 2024).
FPGAs: Bit-level unpack, block-level normalization and rounding, pipelined MAC. Empirically, FP8 E4M3/E5M2 offers identical Top-1 accuracy as INT8 for ImageNet/Vision transformers, at ~20% increased LUT cost, making it Pareto-optimal for high-compression, moderate-area designs (Aggarwal et al., 2023).
PyTorch Ecosystem: TorchAO provides tensor subclass abstractions for FP8, fake quantization for QAT, and end-to-end integration with distributed, serving, and mobile (ExecuTorch) frameworks (Or et al., 21 Jul 2025). All training and inference steps leverage scale-aware FP8 operations with negligible downstream accuracy drop.

6. Domain-Specific Extensions and Best Practices

FP8 workflows are highly extensible, with specialized techniques developed for particular model classes and training regimes:

LoRA and Adapter Fine-Tuning: FALQON merges low-rank LoRA updates into a single FP8-quantized backbone, using a row-wise proxy update buffer to accumulate and threshold top-k most significant updates for hardware-efficient but accurate training. This eliminates the quantization penalty for small adapters and enables 3× throughput vs. conventional QLoRA (Choi et al., 28 Oct 2025).
Optimizer State Quantization: COAT extends FP8 quantization to AdamW optimizer states, using dynamic range expansion (power-law companding) to compress both moments to FP8 with negligible additional error (Xi et al., 25 Oct 2024, Fishman et al., 19 Sep 2024).
SwiGLU Outlier Mitigation: Large-scale LLM training in FP8 requires explicit mitigation of outlier amplification in SwiGLU activations through per-channel scale regularization (Smooth-SwiGLU), avoiding catastrophic overflow at trillion-token durations (Fishman et al., 19 Sep 2024). Delayed scale updates with tight clamping bounds ( $s_\text{floor}$ , $s_\text{ceil}$ ) are essential for numerical stability.

Best practices throughout include:

Avoiding training collapse by immediate activation of regularization, never disabling outlier suppression during early epochs (Liang et al., 28 Nov 2025).
Per-block or per-token granularity for activations to preserve local statistics, especially important in transformers and video diffusion (Wang et al., 26 Sep 2025, Liu et al., 5 Jun 2025).
Scale rounding to power-of-two for maximizing hardware utilization and throughput (Fishman et al., 19 Sep 2024, Lee et al., 13 Mar 2025).
Fuse LayerNorm and bias where feasible, quantize residual streams end-to-end, and apply calibration on small, representative datasets.

7. Quantitative Benchmarks, Empirical Tradeoffs, and Limitations

Empirically, modern FP8 workflows consistently demonstrate:

$1.2\times$ – $2\times$ throughput gains compared to BF16/FP16 (Liang et al., 28 Nov 2025, Fishman et al., 19 Sep 2024, Liu et al., 5 Jun 2025, Or et al., 21 Jul 2025).
Training perplexity and accuracy on par with BF16 (GPT2-Medium: FP8+TWEO PPL 15.64 vs. BF16 16.77; Qwen2.5 LLMs: within 1–2 points, sometimes exceeding) (Liang et al., 28 Nov 2025, Wang et al., 26 Sep 2025).
Memory savings of $1.5\times$ – $2\times$ for weights and activations, with optimizer states reduced $1.98\times$ by COAT (Xi et al., 25 Oct 2024).
Workload coverage of $92.6\%$ (INT8: $65.9\%$ ) within $1\%$ FP32 loss over 75 models (Shen et al., 2023).
FP8 can be less area/power efficient than INT8 in custom ASICs (FP8-E4: $+53–183$ \% vs. INT8 in MAC datapath cost), so tradeoffs should be carefully evaluated in edge deployments (Baalen et al., 2023).

A plausible implication is that while FP8 offers a robust, general-purpose solution for data-center model deployment and LLM-scale continual training, INT8 or mixed INT4/8 pipelines may remain preferred for strictly resource-constrained on-device scenarios. For model families with Gaussian (non-heavy-tailed) activations, INT8 actually matches or slightly exceeds FP8 after QAT or PTQ.

References

TWEO: "Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies" (Liang et al., 28 Nov 2025)
InfiR2: "A Comprehensive FP8 Training Recipe for Reasoning-Enhanced LLMs" (Wang et al., 26 Sep 2025)
FP8-Flow-MoE: "A Casting-Free FP8 Recipe without Double Quantization Error" (Wang et al., 4 Nov 2025)
FPSAttention: "Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion" (Liu et al., 5 Jun 2025)
"Scaling FP8 training to trillion-token LLMs" (Fishman et al., 19 Sep 2024)
COAT: "Compressing Optimizer States and Activation for Memory-Efficient FP8 Training" (Xi et al., 25 Oct 2024)
FALQON: "Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic" (Choi et al., 28 Oct 2025)
TorchAO: "PyTorch-Native Training-to-Serving Model Optimization" (Or et al., 21 Jul 2025)
Efficient Post-training Quantization: (Shen et al., 2023)
FP8-BERT: (Li et al., 2023)
Gaudi: (Lee et al., 13 Mar 2025)
Flexible 8-bit Format: (Zhang et al., 2023)
FLIQS: (Dotzel et al., 2023)
FPGA Minifloats: (Aggarwal et al., 2023)
FP8 vs INT8: (Baalen et al., 2023)
S2FP8: (Cambier et al., 2020)