Papers
Topics
Authors
Recent
Search
2000 character limit reached

E6M2 Floating-Point Scale Overview

Updated 4 March 2026
  • E6M2 floating-point scale is an 8-9 bit numerical format using 1 sign bit, 6 exponent bits, and 2 mantissa bits to provide a wide dynamic range for LLM applications.
  • It implements microscaling and μnit scaling methods to reduce memory usage and computational load while maintaining accuracy.
  • Empirical studies demonstrate that E6M2 achieves significant memory savings with manageable quantization error, optimizing cost-effective model performance.

The E6M2 floating-point scale refers to a numerical format and associated quantization strategy tailored for efficient low-precision training and inference in LLMs. E6M2 denotes an 8- or 9-bit floating-point representation with 1 sign bit, 6 exponent bits, and 2 mantissa bits, offering a wide dynamic range with limited fractional resolution. E6M2 is employed in both standard floating-point quantization schemes and advanced block-wise quantization frameworks such as microscaling and μ\munit scaling, providing substantial savings in memory bandwidth and computational resources—particularly in contexts where traditional 16- or 32-bit formats are prohibitive.

1. Format Specification and Theoretical Properties

E6M2 allocates 1 bit for the sign, 6 bits for the exponent, and 2 bits for the mantissa. With a 6-bit exponent, the exponent bias is b=2611=31b = 2^{6-1} - 1 = 31. This yields the following representable ranges (Cococcioni et al., 2 Oct 2025, Narayan et al., 9 Feb 2025):

  • Smallest positive normal: 2302^{-30}
  • Largest finite normal: (222)231=1.75231(2 - 2^{-2}) \cdot 2^{31} = 1.75 \cdot 2^{31}
  • Smallest positive subnormal: 2322^{-32}
  • Unit roundoff (machine ϵ\epsilon): 23=1/82^{-3} = 1/8
  • Relative error (max): 0.125\lesssim 0.125

The format's dynamic range is sufficient to accommodate activations, gradients, and weights encountered in LLM training under proper scaling.

2. Quantization Scaling Laws and Loss Analysis

The unified scaling law for floating-point quantization, particularly relevant for E6M2, predicts LLM validation loss after training subjected to FP quantization as a function of model size NN, data size DD, exponent bits EE, mantissa bits MM, and block size BB (Sun et al., 5 Jan 2025):

L(N,D,E,M,B)=nNα+dDβ+ϵ+DβNαlog2Bγ(E+0.5)δ(M+0.5)νL(N,D,E,M,B) = \frac{n}{N^\alpha} + \frac{d}{D^\beta} + \epsilon + \frac{D^\beta}{N^\alpha} \frac{\log_2 B}{\gamma (E+0.5)^\delta (M+0.5)^\nu}

For E6M2, (E,M)=(6,2)(E, M) = (6, 2) and typical block size B=128B=128 yield a slightly elevated loss relative to the optimal E4M3 layout (which is closer to the empirical optimum for 8-9 bit formats), but they remain within cost-performance sweet spots, especially for regimes with stringent compute budgets. Exponent bits have approximately 8% more impact on validation loss per bit than mantissa bits (δ3.19\delta \approx 3.19 vs. ν2.95\nu \approx 2.95). For fixed precision P=E+M+1P = E + M + 1, the bit allocation optimizing loss is Eopt=δδ+νP0.5E_{\rm opt} = \frac{\delta}{\delta + \nu}P - 0.5, Mopt=νδ+νP0.5M_{\rm opt} = \frac{\nu}{\delta + \nu}P - 0.5 (Sun et al., 5 Jan 2025).

3. Microscaling and Shared-Exponent Quantization

Microscaling introduces blockwise quantization with a shared scale SS for each block of NN values. For E6M2, a block's scale is determined by the maximum absolute value within the block, scaled so the largest value maps to the largest normal representable value fmaxf_{\max}:

S=fmaxmaxixiS = \frac{f_{\max}}{\max_i |x_i|}

where fmax(6,2)=1.75231f_{\max}(6,2) = 1.75 \cdot 2^{31} (Cococcioni et al., 2 Oct 2025, Narayan et al., 9 Feb 2025).

Each value xix_i in the block is then quantized as qi=round(xi/S)q_i = \text{round}(x_i/S), and dequantization is performed as x^i=qiS\hat{x}_i = q_i \cdot S. Handling of subnormals and overflow follows IEEE-inspired conventions: values below 2322^{-32} round to zero; values above fmaxf_{\max} saturate to ±\pm\infty.

This methodology allows E6M2 to drastically reduce memory footprint: for block size B=32B=32, E6M2 microscaling requires $8.25$ bits per value (including scale overhead), yielding approximately 48%48\% memory reduction vs. bf16 (Cococcioni et al., 2 Oct 2025).

4. Training Implementation and Scaling Strategies

μ\munit scaling enables static scaling based on a calibration or initialization scan per layer (Narayan et al., 9 Feb 2025). For an arbitrary tensor XX and desired E6M2 format, compute:

  • Xmax,i=maxXX_{\max,i} = \max |X| for relevant tensor ii
  • Si=1.75231/Xmax,iS_i = 1.75 \cdot 2^{31} / X_{\max,i}

The scaled and clipped values X~=clip(SiX,fmax,+fmax)\tilde{X} = \text{clip}(S_i X, -f_{\max}, +f_{\max}) are then cast to E6M2, replaced in the forward/backward passes, and dequantized as needed. For LLMs using unit scaling in initialization (e.g., 1/fan-in1/\sqrt{\text{fan-in}}), the distribution of XX is stationary enough that SiS_i may require only infrequent updates.

Backpropagation commonly employs a high-precision master copy of weights, with gradients optionally stored in E6M2; forward and backward computation employ E6M2 or mixed-precision accumulators. Use of an exact or high-precision accumulator per block is necessary to minimize quantization error and avoid accumulation overflow (Cococcioni et al., 2 Oct 2025).

5. Empirical Observations and Practical Trade-offs

Practical use of E6M2 in LLMs such as GPT-2 yields key trade-offs between range, memory, and accuracy (Cococcioni et al., 2 Oct 2025):

  • For block size B=32B=32, E6M2 microscaling achieves 48%\sim48\% reduction in memory usage vs. bf16 at a cost of a modest relative error increase (+0.112).
  • Full use of the dynamic range requires careful tuning or modest amplification of SS when data distribution is tightly concentrated near zero.
  • Configurations mixing Microscaling E6M2 with master-copy bf16 storage for sensitive parameters (e.g., probabilities) provide stable training with only marginal accuracy degradation.
  • Lower mantissa bits induce coarser quantization, necessitating higher-precision accumulation or mixed-precision staging for dot-product computations to avoid overflow and excessive quantization loss.

Table: Example Empirical Results for GPT-2 (Tiny Shakespeare, 124M parameters, 100 steps, (Cococcioni et al., 2 Oct 2025))

Configuration Memory Reduction vs. bf16 Rel. Error Increase
bf16 (all) Baseline +0.112
E6M2 Microscaling (B=32) + bf16 master 48% +0.112
E4M3 Microscaling (weights+grads) >50% +0.995

6. Critical Data Size and Limitations

A key finding for low-precision training is the existence of a critical data size DcritD_{\rm crit} beyond which further increases in token count can degrade model performance due to excessive quantization-induced information loss. For E6M2, DcritD_{\rm crit} is parametrized by model size NN, block size BB, and precision allocation, and is computed as (Sun et al., 5 Jan 2025):

Dcrit=[dγNα(E+0.5)δ(M+0.5)νlog2B]1/(2β)D_{\rm crit} = \left[\frac{d \gamma N^\alpha (E+0.5)^\delta (M+0.5)^\nu}{\log_2 B}\right]^{1/(2\beta)}

For standard LLM regimes, E6M2 raises DcritD_{\rm crit} substantially above E4M3 or E2M1, making it better suited for longer training runs and larger datasets under fixed precision. Surpassing DcritD_{\rm crit} leads to increasing loss, limiting the utility of over-training under aggressively quantized settings.

7. Cost-Optimal Precision and Future Directions

For compute- and memory-bound regimes, the family of 4–8 bit floating-point formats remains cost-optimal according to floating-point scaling laws. E6M2 (8 or 9 bits) lies near the upper end of this tradeoff, offering increased range at the expense of reduced fractional accuracy compared to E4M3 or E3M4. Its principal application is in blockwise-quantized LLMs and transformers where large dynamic range is demanded by activation and gradient statistics, particularly in deeper models or those employing post-activation normalization.

A plausible implication is that further hardware support for E6M2-style blockwise quantization, exact per-block accumulation, and mixed-precision master copies would unlock additional efficiency gains while safeguarding model accuracy (Sun et al., 5 Jan 2025, Cococcioni et al., 2 Oct 2025, Narayan et al., 9 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to E6M2 Floating-Point Scale.