FP8 Emulation Toolkit

Updated 4 December 2025

FP8 emulation toolkits are software libraries that simulate 8-bit floating-point formats to support efficient deep learning workflows.
They implement standardized quantization routines, calibration protocols, and dynamic scaling strategies to maintain numerical stability and high accuracy.
These toolkits integrate optimized operator support and format-specific handling to reduce memory footprint and boost throughput in large-scale models.

A floating-point 8-bit (FP8) emulation toolkit enables neural network workflows to simulate or deploy FP8 numerics for model weights, activations, gradients, and auxiliary states, supporting both training and inference. Such toolkits facilitate resource-efficient deep learning by leveraging the FP8 format’s favorable tradeoff between dynamic range, precision, and memory bandwidth, while offering mechanisms for numerically robust operation in representative architectures, including LLMs and Mixture-of-Experts (MoE) transformers. FP8 emulation toolkits are now central in both empirical studies and industrial pipelines, with open-source implementations enabling standardized benchmarking and method development.

1. FP8 Number Representations and Format Taxonomy

FP8 emulation toolkits typically support multiple 8-bit floating-point formats characterized by distinct exponent/mantissa allocations. The principal variants—in hardware and software—are E5M2, E4M3, and E3M4; UE8M0 is sometimes used for representing exact-scaling factors (Wang et al., 26 Sep 2025).

Format	Exponent bits	Mantissa bits	Max Normal Value	Bias	Typical Application
E5M2	5	2	57344	15	Activations (wide range)
E4M3	4	3	448	7	Model weights (blockwise)
E3M4	3	4	30	3	High-precision weights
UE8M0	8	0	Power-of-two	127	Exact scale storage

All FP8 values are encoded as $x = (-1)^s \times 2^{E-\text{bias}} (1+M/2^m)$ , with E and M the exponent and mantissa fields. For subnormals (E=0), special encoding applies (Shen et al., 2023, Kuzmin et al., 2022, Wang et al., 26 Sep 2025).

Key format selection heuristics include:

E4M3: default for NLP/LLM activations and weights unless extreme outliers are present
E3M4: suitable for weights in vision models where precision is paramount
E5M2: optimal for activations with large dynamic range in LLMs or deep attention stacks (Shen et al., 2023, Wang et al., 26 Sep 2025)

2. Quantization and Emulation Workflow

FP8 emulation toolkits provide standardized quantization and dequantization routines, calibration protocols, and integration with deep learning frameworks.

Quantization Mapping

Given a real tensor $X$ , the scale is computed as

$S = \frac{\max_i |X_i|}{V_{\max}}$

where $V_{\max}$ is the positive peak representable value under the target FP8 encoding. A tensor can be quantized by

$q = \mathrm{round}\left(\frac{x}{S}\right),\quad q_{\min} \leq q \leq q_{\max}$

and dequantized by $x_{\mathrm{deq}} = q \times S$ (Lee et al., 29 May 2024, Wang et al., 26 Sep 2025).

Calibration proceeds by per-tensor or per-channel (or per-block/per-token) max-abs statistics computed over a calibration set or mini-batch (Shen et al., 2023).

Workflow Features

Blockwise (weights) or groupwise (activations, optimizer states) quantization granularity
Storage of scale factors in UE8M0 for exact conversion
Recalibration of normalization statistics (e.g., BatchNorm/LayerNorm) for minimal accuracy loss
Drop-in “fake quantization” with operator-level override for extended operator support (LayerNorm, EmbeddingBag, elementwise ops)
Formatting for efficient kernel fusion (e.g., for AVX512-FP8 and NVIDIA Hopper Tensor Cores) (Wang et al., 4 Nov 2025, Xi et al., 25 Oct 2024)

3. Error Analysis, Scaling Strategies, and Stability

Reducing precision to FP8 magnifies error propagation. Recent toolkits employ several mechanisms to ensure stability and match BF16 convergence in practice.

Double Quantization and Scaling-Aware Operators

Operations involving transpositions between differently quantized FP8 tensors (e.g., row-wise to column-wise in MoE) are susceptible to “double quantization error.” If a naive quantize-dequantize-requantize pipeline is used, the elementwise error bound is

$\|Q_{\mathrm{col}}(D(Q_{\mathrm{row}}(x))) - x \| \leq 2 \frac{q_{\max}}{2^m}$

whereas a scaling-aware transpose that simply remaps the exponent fields without new rounding reduces this to

$\|Q_{\mathrm{row}}(x) - x\| \leq \frac{1}{2} \frac{q_{\max}}{2^m}$

(Wang et al., 4 Nov 2025). This approach is now used to guarantee numerical fidelity in FP8-centric MoE pipelines.

Dynamic Range Expansion

When compressing optimizer states to FP8, direct quantization can underutilize the dynamic range, resulting in increased quantization noise. Dynamic range expansion applies a power-law map $f(x) = \mathrm{sign}(x) |x|^k$ , choosing $k$ to map the group’s range to the FP8 range, then quantizes the transformed state. On dequantization, the inverse is applied. This improves MSE for state updates, particularly for Adam-type optimizers (Xi et al., 25 Oct 2024).

Hybrid-Granularity Quantization and Scale Rounding

Contemporary toolkits decouple granularity:

Blockwise for weights (e.g., blocks of 64 or 128 for GEMMs)
Groupwise or per-token for activations (1 $\times$ hidden dimension or group size 16 for non-linear ops)

Rounding scale factors to the nearest power of two guarantees no scale underflow and is often used with UE8M0 encoding for compactness (Wang et al., 26 Sep 2025).

Loss Scaling and Adaptive Precision

Toolkit recipes recommend maintaining master weights, gradients, and optimizer states in FP32, with only the forward/backward path quantized to FP8. Dynamic loss scaling and blockwise outlier monitoring mitigate catastrophic underflow/overflow (Wang et al., 26 Sep 2025, Lee et al., 29 May 2024).

4. Toolkit Architectures: Pseudocode, APIs, and Framework Plug-Ins

FP8 emulation toolkits are implemented as Python modules or C++ extensions. They provide scalable APIs for quantization, operator patching, and compatibility with major deep learning frameworks.

Example API Integration

TransformerEngine and Megatron-LM: Custom FP8 operators (quantize, dequantize, fused GEMM, scaling-aware transpose) registered automatically to replace reference kernels. Minimal user intervention is required apart from initial configuration (Wang et al., 4 Nov 2025).
COAT: Adds optimizer and activation quantization wrappers, supporting PyTorch, FSDP, and DeepSpeed. Activations are quantized by operator type (per-tensor for GEMMs, per-group for others). Optimizer state quantization is transparent; scale management occurs in BF16 (Xi et al., 25 Oct 2024).
Intel Neural Compressor (INC): Configuration-driven quantization flows, automatic insertion of fake-quant ops, static and dynamic quantization strategies, and hardware-matched rounding (Shen et al., 2023).
Software-only Emulators: Provide mask-and-clamp wrappers for PyTorch Linear modules. Precision (exponent and mantissa bits) can be swept per-layer for ablation and stability studies (Lee et al., 29 May 2024).

Typical Usage Pseudocode

from fp8_flow_moe import FP8MoEConfig, register_fp8_moe_kernels

fp8_cfg = FP8MoEConfig(fmt="E4M3", tile_size=128, fuse_swiglu=True, use_scaling_transpose=True)
register_fp8_moe_kernels(args, fp8_cfg)
main()

(Wang et al., 4 Nov 2025)

For optimizer and activation quantization:

from coat import FP8OptimizerAdamW, fp8_autocast, initialize_activation_quant

optimizer = FP8OptimizerAdamW(model.parameters(), fp8_format='E4M3', dynamic_expand=True)
initialize_activation_quant(fp8_format='E4M3', per_tensor_linear=True, act_group_size=16)

with fp8_autocast():
    logits = model(inputs)
    optimizer.backward(loss)
optimizer.step()

(Xi et al., 25 Oct 2024)

5. Experimental Benchmarks and Performance Impact

FP8 emulation toolkits have been evaluated on large-scale pretraining and inference across LLMs, MoE, CV, and VLMs.

Memory and Speedup

Throughput: Up to 21% higher compared to blockwise FP8 and BF16 on 671B MoE models (Wang et al., 4 Nov 2025). End-to-end speedup up to 1.43× with COAT, 1.14× for OLMo-7B (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025).
Memory footprint: Peak GPU memory savings of 14–16.5 GB per device versus BF16, corresponding to up to 1.54× reduction (Wang et al., 4 Nov 2025, Xi et al., 25 Oct 2024).
Operator overhead: Additional bit-manipulation introduces some throughput penalty (1.5–2×) during emulation-only evaluation in backward-incompatible settings (Lee et al., 29 May 2024).

Accuracy, Robustness, and Workload Coverage

Loss and accuracy parity: On 16–160B token pretraining, loss curves for FP8 and BF16 overlap, with ≤2-point variation on reasoning/test benchmarks (Wang et al., 4 Nov 2025, Wang et al., 26 Sep 2025).
Pass rate: >92% of tested architectures achieve ≤1% accuracy drop relative to FP32 in post-training quantization with FP8 (vs. 66% for INT8), especially for E4M3 and E3M4 (Shen et al., 2023).
Model-type recommendations:
- E4M3 best for NLP, LLMs, and tasks with activation outliers.
- E3M4 advantageous for convolutional models and feature-bounded vision workloads.
- Dynamic quantization yields further error reduction for models with highly skewed activation distributions.

6. Best Practices, Domain Guidance, and Future Research

Optimal format selection: Tailor the format to data statistics—use more exponent bits with heavy-tailed or outlier-rich activations (E5M2/E4M3 for NLP; E3M4 for CV) (Shen et al., 2023, Kuzmin et al., 2022, Wang et al., 26 Sep 2025).
Granularity: Use blockwise quantization matching the hardware GEMM tile (64/128); groupwise for high-variance activations.
Safeguards: Maintain master weights, gradients, and optimizer state in FP32, employ dynamic loss scaling, and monitor per-block maxima for stability (Wang et al., 26 Sep 2025).
Toolkit selection: For hardware deployment, choose toolkits with hardware-matched rounding and scaling. For research and ablation, use reference software emulators supporting full (exponent, mantissa) parameter sweeps.
Integration: Insert only minimal fake-quant boundaries (ideally only at precision sensitive block borders) to leverage reduced memory traffic and maximize fusion potential (Wang et al., 4 Nov 2025).
Future directions: FP8 toolkit development continues toward broader coverage (e.g., optimizer state mapping, extended operator support), adaptive quantization based on learned statistics, and automated selection routines for optimal exponent/mantissa splits in quantization-aware training (Xi et al., 25 Oct 2024, Wang et al., 26 Sep 2025, Kuzmin et al., 2022).

FP8 emulation toolkits now provide robust, reproducible, and hardware-efficient quantization support, bridging the gap between floating-point deep learning pipelines and the emerging capabilities of FP8-accelerated hardware backends.