FP8 & BF16 Representations

Updated 4 January 2026

FP8 and BF16 are low-precision numerical formats that optimize memory usage and computational efficiency in deep learning environments.
BF16 preserves a dynamic range similar to FP32 with reduced precision, while FP8 offers significant speed and memory savings at the cost of increased training instability.
Adaptive mixed-precision techniques, including dynamic loss scaling and block-floating scaling, enable effective integration of FP8 and BF16 in large-scale model training.

FP8 and BF16 are floating-point representations designed to optimize the memory and computational efficiency of deep learning systems, particularly for LLMs. BF16 is a 16-bit format with 8 exponent bits and 7 mantissa bits, originally introduced as a truncation of IEEE FP32, while FP8 encompasses several 8-bit formats, most notably E4M3 (4 exponent, 3 mantissa) and E5M2 (5 exponent, 2 mantissa). The technical differences between these formats yield distinct trade-offs in range, precision, hardware compatibility, and training behavior.

1. Numeric Format Properties and Floating-Point Encodings

The structural characteristics of BF16 and FP8 (E4M3/E5M2) formats dictate their suitability for different deep learning tasks.

Bit-Field Definitions

Format	Exponent Bits	Mantissa Bits	Unit Roundoff (ε)	Dynamic Range
BF16	8	7	$2^{-7}$	$\pm2^{127}$
FP8 E4M3	4	3	$2^{-3}$	$\pm2^{14}$
FP8 E5M2	5	2	$2^{-2}$	$\pm2^{30}$

BF16: $x = (-1)^s 2^{e - 127} (1 + m/2^7)$ , covering a dynamic range similar to FP32 but at 7 bits of precision (Fujii et al., 2024).

FP8 E4M3/E5M2: $x = (-1)^s 2^{e - \text{bias}} (1 + m/2^M)$ , with bias $2^{E-1}-1$ , offering much lower precision and narrower dynamic range (Micikevicius et al., 2022).

Error and Quantization Analysis

Quantization error for casting to FP8: For any real $x$ , rounding bounds are $[ -\epsilon_{FP8} |x|, +\epsilon_{FP8} |x| ]$ , with rounding-nearest-even and block-floating scaling employed to avoid underflow/overflow.
Machine epsilon (unit roundoff): $\epsilon_{BF16} = 2^{-7}$ , $\epsilon_{E4M3} = 2^{-3}$ , $\epsilon_{E5M2} = 2^{-2}$ .
Scaling: Block-floating scaling via hardware-specific libraries (e.g., NVIDIA TransformerEngine) maintains per-tensor scale factors.

2. Implementation Methodologies

FP8 and BF16 integration is enabled via specialized hardware and software stacks. The NVIDIA H100 GPU, supported by TransformerEngine, provides native FP8 execution alongside FP16 and BF16 (Fujii et al., 2024). Key implementation components include:

Custom CUDA kernels for FP8 GEMMs and elementwise ops.
Mixed-precision frameworks (e.g., MS-AMP, MoR) that dynamically select between FP8 and BF16 at the tensor or sub-tensor granularity (Su et al., 28 Dec 2025).
Block-floating scaling: Shared exponent (amax) updated every iteration, with history length and intervals tuned per model.
Dynamic loss scaling: Used to absorb FP8 overflows/underflows in backpropagation.

Representative Pseudocode (Dynamic Loss Scaling)

scale ← initial_scale  # e.g. 2^20
for each iteration:
  loss_fp32 ← forward(fp8_inputs) * scale
  grads_fp32 ← backward(loss_fp32)
  if any isnan(grads_fp32) or isinf(grads_fp32):
    scale ← scale / 2
    skip update
  else:
    unscaled_grads ← grads_fp32 / scale
    optimizer.step(unscaled_grads)
    if iteration % window == 0:
      scale ← min(scale * 2, max_scale)

(Fujii et al., 2024)

3. Memory and Throughput: Practical Impacts

FP8's principal advantages—reduced memory footprint and increased throughput—are demonstrated across large-scale LLM and vision model training.

Model Setting	BF16 Throughput	FP8 Throughput	Memory Saving
H100 Node	415 TFLOPS	570 TFLOPS (+37%)	~25%
Llama2-7B	7,730 tok/s	11,257 tok/s	~1.54× reduction

FP8 reduced activation storage by ~25% compared to BF16; optimizer and activation compression with COAT achieves up to 1.54× memory savings and 1.43× speedup (Xi et al., 2024).
End-to-end training speedup measured on 175B parameter LLMs: FP8 mixed-precision training achieves up to 75% faster wall-clock time over BF16 (Peng et al., 2023).

Memory reductions allow larger local batch sizes or deeper pipelines, and practical distributed setups frequently double batch size for a given GPU configuration.

4. Training Stability and Convergence Properties

Training instability is the critical trade-off for aggressive FP8 utilization.

BF16: Exhibits smooth, monotonic decrease in training loss, with low gradient-norm variance.
FP8: Prone to loss spikes (NaN or $\gg$ 10× local jumps), higher gradient-norm variance (~1.5× BF16), and frequent scale-factor oscillations (Fujii et al., 2024).
Instability is exacerbated for tasks with high numerical sensitivity (arithmetic reasoning, code generation, long-context dialogs), with statistically significant accuracy degradation in FP8 (e.g., –3 to –5 p.p. in specific benchmarks).
Block-level dynamic loss-scaling and aggressive back-off mitigates instability, reducing spike frequency at a minor cost to throughput (~5%).

Several works (e.g., TWEO (Liang et al., 28 Nov 2025), Smooth-SwiGLU (Fishman et al., 2024)) identify activation outliers resulting from weight matrix colinearity as a root cause of catastrophic FP8 failures, and propose regularization or architectural modifications to eliminate extreme spikes.

5. Downstream Task Performance and Format Selection Trade-offs

Empirical studies demonstrate nuanced FP8 vs. BF16 performance across categories of downstream tasks.

Task Category	FP8 vs BF16 Accuracy Gap	p-value (significance)
General Language	–0.2 p.p.	p > 0.1
Math / Code	–2.5 to –5 p.p.	p < 0.01
QA / Translation	≤ 0.5 p.p.	p > 0.1

FP8 is resilient for general NLP and QA, incurring negligible degradation, and offers compelling speed/memory advantages in hardware-constrained scenarios.
For math/code and other numerically sensitive tasks, BF16 remains preferred for stability and checkpoint-recovery efficiency.
MoR achieves >98% FP8 quantization across tensors using dynamic format selection, maintaining loss/validation differences within 0.5% of BF16 even under large quantization partition strategies (Su et al., 28 Dec 2025).

6. Advanced and Flexible Precision Schemes

Recent work explores dynamic or flexible precision assignment:

Mixture-of-Representations (MoR): Tensor/block-level format selection, employing relative error and dynamic range metrics to choose among E4M3, E5M2, and BF16 (Su et al., 28 Dec 2025).
Flexible Floating-Point 8 (FFP8): Format parameters (sign, exponent, fraction widths, bias) are selected per tensor, yielding accuracy within 0.1–0.3% of FP32 at half the bandwidth of BF16, with hardware cost <5% area/latency overhead (Huang et al., 2021).
HiFloat8 (HiF8): Tapered mantissa across exponent regions and denormal extensions, combining hardware-efficient rounding modes. Training and inference closely mimic FP16/BF16 accuracy (<0.5% drop) (Luo et al., 2024).

Modular quantization and adaptive block-wise scaling strategies enable "FP8-all-the-way" and further ultra-low-precision training regimes without loss of model quality, contingent on robust error metric tracking and regularization.

7. Hardware, Scaling, and Future Work

FP8 deployment is closely aligned with hardware capabilities (NVIDIA H100, Intel Gaudi2), requiring library, kernel, and hardware support for fast GEMMs and mixed-precision accumulators. Optimal margin and interval settings, block/group scaling, auto-tuning, and possible stochastic rounding are areas of active investigation for maximizing stability. The effect of FP8 on ever-larger architectures (>405B parameters), and the potential for generalized static quantization paradigms enabled by architectural fixes (e.g., TWEO), remain open research frontiers.

A plausible implication is that future large-scale training will rely on hybrid schemes, where nearly all GEMM operations exploit FP8 formats, with fallback to higher bit depths only as dictated by runtime stability diagnostics.

In conclusion, FP8 and BF16 are numerically and operationally distinct low-precision representations whose adoption is governed by computational resource constraints, task sensitivity, stability requirements, and hardware support. FP8 offers substantial efficiency gains but necessitates careful scaling strategies and regularization against activation outliers to achieve stability and quality on par with BF16. Continued innovation in format adaptivity and stability assessment will further shape the landscape of mixed-precision deep learning (Fujii et al., 2024, Xi et al., 2024, Su et al., 28 Dec 2025, Luo et al., 2024, Fishman et al., 2024, Peng et al., 2023, Huang et al., 2021, Hernández-Cano et al., 26 May 2025, Liang et al., 28 Nov 2025, Lee et al., 2024).