Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Block Floating-Point (BFP)

Updated 4 March 2026
  • Hierarchical block floating-point (BFP) uses the E6M2 format with 1 sign, 6 exponent, and 2 mantissa bits to achieve a wide dynamic range for low-precision LLM training.
  • It leverages microscaling quantization by partitioning tensors into blocks that share a power-of-two scale, ensuring values fit within the representable range with minimal error.
  • Empirical results show that E6M2 yields significant memory and compute reductions while maintaining output coherence, despite a modest increase in quantization loss.

The E6M2 floating-point scale is an 8- or 9-bit floating-point numerical format characterized by 6 exponent bits and 2 mantissa bits, primarily deployed for low-precision training and inference of LLMs. E6M2 enables aggressive memory and compute reductions while providing a wide dynamic range, making it suitable for both microscaling block-quantization frameworks and static unit-scaling pipelines. Its unique exponent–mantissa tradeoff, scaling law implications, quantization procedures, and empirical performance on transformer architectures have been the subject of recent in-depth arXiv studies.

1. Definition and Numerical Properties

E6M2 denotes a floating-point representation with 1 sign bit, 6 exponent bits (E=6E=6), and 2 mantissa bits (M=2M=2), for a total width of 8 (microscaling) or 9 (block-quantized) bits depending on context (Cococcioni et al., 2 Oct 2025, Sun et al., 5 Jan 2025). The exponent bias is b=2E11=31b=2^{E-1}-1=31, with normalized range for exponent field k=162k=1\ldots 62, yielding real exponents emin=1b=30e_{\text{min}}=1-b=-30 to emax=(262)b=+31e_{\text{max}}=(2^6-2)-b=+31. The mantissa resolution is $1/4$.

Dynamic range and precision:

  • Minimum positive normal: 2302^{-30}
  • Maximum finite normal: (222)231=1.75231(2-2^{-2}) \cdot 2^{31} = 1.75\cdot 2^{31}
  • Minimum positive subnormal: 2322^{-32}
  • Unit roundoff (machine epsilon): ϵ23=1/8\epsilon \approx 2^{-3} = 1/8 (relative error 0.125\lesssim 0.125)

This dynamic range is sufficient to accommodate near-peak intermediate values in LLMs, though the 2-bit mantissa provides only four subdivisions within each power-of-two bin.

2. Quantization Methodologies and Scale Computation

Microscaling Quantization

Microscaling (Cococcioni et al., 2 Oct 2025) partitions tensors into blocks (typical size B=32B=32 or $128$), sharing a single power-of-two scale SS per block. The scale factor is computed to ensure all values fit within the representable E6M2 range. For a tensor block X={x1,,xN}X = \{x_1,\ldots, x_N\}, SS is chosen such that:

S=fmax(E,M)maxixiS = \frac{f_{\text{max}}(E,M)}{\max_i |x_i|}

where fmax(6,2)=1.75231f_{\text{max}}(6,2) = 1.75 \cdot 2^{31}. Quantized values are qi=round(xi/S)q_i = \operatorname{round}(x_i/S); dequantization recovers x^i=qiS\hat x_i = q_i \cdot S. Blockwise quantization facilitates memory efficiency and hardware alignment.

μ\munit Scaling and Static Scaling

Unit scaling (Narayan et al., 9 Feb 2025) simplifies the pipeline by statically calibrating SS based on empirical peak magnitudes (usually per layer or block). The static SS exploits the unit-variance normalization of transformer layers: typical signals then cluster in O(1)|\cdot|\lesssim O(1), maximizing dynamic range exploitation. For tensors with atypically narrow dynamic range, a calibrated amplification factor can be folded into SS to avoid underutilization of exponent bits, though this must be balanced carefully to avoid outliers saturating to ±\pm\infty.

3. Scaling Laws and Performance Theory

Validation loss LL for an LLM quantized in E6M2 may be predicted via the unified scaling law (Sun et al., 5 Jan 2025):

L(N,D,E,M,B)=nNα+dDβ+ϵ+DβNαlog2Bγ(E+0.5)δ(M+0.5)νL(N, D, E, M, B) = \frac{n}{N^\alpha} + \frac{d}{D^\beta} + \epsilon + \frac{D^\beta}{N^\alpha} \frac{\log_2 B}{\gamma (E+0.5)^\delta (M+0.5)^\nu}

with empirically fitted constants: α=0.2368\alpha=0.2368, β=0.5162\beta=0.5162, n=69.2343n=69.2343, d=6.8973×104d=6.8973\times 10^4, ϵ=1.9061\epsilon=1.9061, γ=1.1335×104\gamma=1.1335\times 10^4, δ=3.1926\delta=3.1926, ν=2.9543\nu=2.9543.

Compared to E4M3 (the empirically optimal 8-bit exponent–mantissa split under this framework), E6M2 exhibits slightly higher quantization loss for a similar bit width due to its higher exponent-to-mantissa ratio.

Exponent vs. Mantissa Impact

Exponent bits have approximately 8% more effect on loss per bit than mantissa bits (δ3.19\delta\approx3.19 vs ν2.95\nu\approx2.95), thus increasing EE is marginally more beneficial for loss minimization given a fixed total bit budget. The optimal bit split for P=E+M+1P=E+M+1 total bits is:

Mopt=νδ+νP0.5,Eopt=δδ+νP0.5M_{\rm opt} = \frac{\nu}{\delta+\nu} P - 0.5, \quad E_{\rm opt} = \frac{\delta}{\delta+\nu} P - 0.5

E6M2 (E=6E=6, M=2M=2) with P=9P=9 approximates Eopt4.2E_{\rm opt}\approx4.2, Mopt3.8M_{\rm opt}\approx3.8 for P=8P=8; this deviation results in modestly greater quantization error than E4M3 (Sun et al., 5 Jan 2025).

4. Implementation Details and Numerical Pathologies

Quantization Granularity

Block-size BB selection mediates overhead and quantization error. Empirically, block sizes B=128B=128 or channel-wise partitioning yield nearly indistinguishable loss profiles (Sun et al., 5 Jan 2025). Smaller blocks improve scale adaptivity at the cost of greater metadata overhead and scale storage. Exact/integer accumulation within blocks is key to minimizing rounding errors and suppressing in-block overflow.

Subnormals, Overflow, and Underflow

Values outside the normal exponent range utilize subnormal encoding where possible (k=0k=0, m0m\ne0), with underflow truncating to zero and overflow saturating to ±\pm\infty. Static scales or erroneously amplified scales can lead to non-reversible quantization artifacts if data distribution assumptions are violated.

Gradient Handling and Mixed Precision

Common LLM training schemes retain a full-precision “master copy” of parameters, updating with high-precision optimizers (e.g., AdamW), and re-quantizing for forward/backward propagation in E6M2. Storing gradients in E6M2 is feasible for memory reduction (Cococcioni et al., 2 Oct 2025), but dot-products and accumulations should use higher-precision or exact arithmetic to suppress blockwise sum errors.

5. Empirical Results and Application to LLMs

Experiments with E6M2 microscaling quantization on GPT-2 (124M parameters, Tiny Shakespeare, B=32B=32) demonstrate a memory reduction of 48% relative to bf16 and 25% to full-precision, with only moderate loss inflation (see table below) (Cococcioni et al., 2 Oct 2025):

Configuration Loss vs. baseline Rel. error
bf16 + master copy +0.09% +0.0036
Microscaling/float16 +0.48% +0.0187
E4M3 mats/grads + bf16 +25.7% +0.995
Full E4M3 Microscaling +122% +4.745

Inference with E6M2-based (and similar) formats is observed to preserve output coherence, whereas reduction of exponent bits (e.g., E3M4) quickly destabilizes model output due to underflow and overflow artifacts.

6. Critical Data Size and Compute-Optimal Precision

Under low-precision quantization, exceeding a critical data size DcritD_{\rm crit} increases loss:

Dcrit=[dγNα(E+0.5)δ(M+0.5)νlog2B]1/(2β)D_{\rm crit} = \left[ \frac{d\,\gamma\,N^\alpha\,(E+0.5)^\delta\,(M+0.5)^\nu} {\log_2 B} \right]^{1/(2\beta)}

For E6M2, DcritD_{\rm crit} is finite but significantly larger than for E4M3 or E2M1 at fixed NN, meaning practical large-scale LLM training is typically below this threshold (Sun et al., 5 Jan 2025).

In compute-constrained training, cost-optimal precision is dictated by total FLOPs (CC), with 4–8 bits per value found to be optimal across 102110^{21}103110^{31} FLOPs—a range inclusive of E6M2’s 8- or 9-bit width. Using lower bits permits training models of larger width or on more data at fixed compute, while higher bits confer diminishing returns.

7. Trade-offs and Practical Recommendations

  • Format trade-off: E6M2 maximizes dynamic range (large EE) at the expense of resolution (small MM), favoring stability for activation and gradient tensors in deep transformers.
  • Scaling methodology: Static scaling is favored when signal distributions are stable; dynamic or blockwise scaling is preferable for highly non-stationary data (Narayan et al., 9 Feb 2025).
  • Accumulation fidelity: Exact accumulators are essential for suppressing blockwise summation errors; in-block oversights can compound quantization loss.
  • Loss minimization: Exponent bits should be favored for overall model performance given limited bit budget.
  • Application regime: For practical LLM training and inference, E6M2 offers an attractive compromise of memory efficiency, training stability, and ease of hardware implementation, providing near-optimal scaling loss at high compute efficiency when paired with proper master-copy and mixed-precision techniques (Sun et al., 5 Jan 2025, Cococcioni et al., 2 Oct 2025, Narayan et al., 9 Feb 2025).

E6M2 floating-point scale, as defined and characterized in current research, serves as a viable precision format for large-scale deep learning under modern quantization and memory-constrained training paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Block Floating-Point (BFP).