Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiFloat8: Adaptive 8-Bit Deep Learning Format

Updated 8 March 2026
  • HiFloat8 is an 8-bit floating-point format with a tapered-precision design that dynamically adjusts the allocation between exponent and mantissa bits.
  • It supports both training and inference for CNNs, transformers, and LLMs by nearly matching FP16 dynamic range while reducing memory and computational overhead.
  • Hardware-optimized for Ascend NPUs, HiFloat8 offers robust quantization, efficient rounding modes, and seamless integration with mixed-precision workflows.

HiFloat8 (HiF8) is an 8-bit floating-point data format specifically engineered for modern deep learning workloads, offering a tapered-precision scheme that dynamically trades off between precision and dynamic range. HiF8 was initially proposed for the Ascend AI accelerator family, and has since been systematically evaluated across training and inference for convolutional neural networks (CNNs), transformers, and LLMs. The core principle underpinning HiF8 is its adaptive allocation of mantissa and exponent bits, achieving higher precision where most critical, while nearly matching the exceptional dynamic range of FP16—addressing the limitations of static 8-bit floating-point alternatives and integer quantization schemes (Luo et al., 2024, Zhao et al., 13 Feb 2026, Ye et al., 2 Feb 2026).

1. Bit-Level Structure and Encoding

HiF8 uses 8 bits organized with non-uniform field allocation, separating it from classical IEEE-754 derivatives and block-scaled formats.

  • Fields and Mode Selection:
    • Sign (S/s): 1 bit (0 for positive, 1 for negative).
    • Taper ("Dot") field: 2–4 bits; uniquely prefix-coded to signal five distinct modes (D ∈ {4,3,2,1,0}) or a denormal (DML) mode.
    • Exponent (Em): D bits (sign-magnitude, no bias), with D chosen by the dot-code.
    • Mantissa (M): (5–D) bits in normal modes; in DML (denormal), the mantissa functions as an exponent extension.

The general encoding for a normal number is: X=(1)S×2E×(1+M2P)X = (-1)^S \times 2^E \times \left(1 + \frac{M}{2^P}\right) where EE is the decoded signed exponent, and P=5DP=5-D is the mantissa bit width. For denormals, the mantissa encodes a biased exponent, extending range deeper into small magnitudes: X=(1)S×2M23×1.0(M[1,7])X = (-1)^S \times 2^{M - 23} \times 1.0 \quad (M \in [1,7])

  • Dynamic Range: EE covers [22-22, +15+15], for a total of $38$ exponent binades (versus FP16's $40$).
  • Zero/Infinity/NaN: All special values except dual-encoded zeros. Zero is unique; the largest normal codes yield ±\pm\infty and NaN.

The allocation of exponent and mantissa bits is summarized as follows:

Mode D Exponent binade (E) Exponent bits Mantissa bits Description
DML [–22, –16] 0 0 (biased ext.) Denormal
4 [–15, –8] 4 1 Wide range
3 [–7, –4] 3 2 Medium band
2 [–3, –2] 2 3 Fine band
1 [–1, +1] 1 3 Center band

Mantissa granularity thus increases as exponent magnitude decreases, implementing tapered precision (Luo et al., 2024, Zhao et al., 13 Feb 2026).

2. Tapered-Precision and Quantization

HiF8's variable-precision mechanism dynamically adjusts the division between exponent and mantissa, allocating more mantissa bits for small exponents where relative error is more impactful.

  • Tapered Precision: Fractional bits (P) allocated to mantissa vary as a staircase with respect to exponent binade EE. Centered values (E ≈ 0) have up to 3 mantissa bits, ensuring higher fidelity for most neural network activations and gradients.
  • Quantization/Dequantization (as implemented):
    • Given xRx \in \mathbb{R}:
    • Compute exponent e=log2(x+ϵ)e = \lfloor \log_2(|x|+\epsilon)\rfloor.
    • Select nm(e)n_m(e) mantissa bits according to e|e|:
    • nm=3n_m=3 if e3|e| \leq 3,
    • nm=2n_m=2 if 3<e73 < |e| \leq 7,
    • nm=1n_m=1 if 7<e157 < |e| \leq 15,
    • nm=0n_m=0 otherwise.
    • Quantize:

    x~=x2enm(e),mq=round(x~)\tilde x = \frac{|x|}{2^{e - n_m(e)}}, \qquad m_q = \mathrm{round}(\tilde x) - Dequantize:

    QHiF8(x)=(1)smq2enm(e)Q_{\mathrm{HiF8}}(x) = (-1)^{s} \cdot m_q \cdot 2^{e - n_m(e)}

Notably, for extreme exponents the representation degenerates to a pure power-of-two, maximizing range at the cost of fractional resolution (Luo et al., 2024, Zhao et al., 13 Feb 2026, Ye et al., 2 Feb 2026).

3. Rounding Modes and Conversion Workflow

Casting from higher-precision floating-point (FP32/FP16/BF16) to HiF8 supports two principal rounding modes for optimal training and inference:

  • Round-Half-Away (TA): Nearest rounding with ties rounded away from zero; enables marginally better AI-training accuracy and simpler hardware than round-to-even.

  • Hybrid Rounding (HR): TA applied when source exponent<4|\text{source exponent}|<4; otherwise, a simplified, threshold-based stochastic rounding (using fixed-width thresholds rather than RNGs) approximates 1 ulp or 0.75 ulp accuracy.

  • Overflow Handling: Clamp to maximum representable value, with optional NaN-to-zero saturation.

For forward pass, TA is used exclusively; backward pass supports both TA and HR according to gradient distribution characteristics (Luo et al., 2024).

4. Training and Inference Protocols

HiF8 is compatible with both traditional and LLM models, directly mirroring established mixed-precision workflows but benefiting from improved dynamic range and quantization adaptability.

A. Standard Deep Network Training:

  • All core GEMM inputs (activations, weights, activation gradients) are stored in HiF8.

  • Accumulation is performed in FP16.

  • All other numeric operations use FP32 or FP16 as appropriate.

  • Gradient underflow is prevented via global backward loss-scaling.

B. LLM Training:

  • Backward Loss-Scaling (BLS): As above.

  • Adaptive Loss-Scaling (ALS): Dynamically adapts loss-scale window for gradient distribution.

  • Per-Tensor Scaling (PTS): Power-of-two scaling maintained per GEMM input, updated periodically to ensure optimal coverage of HiF8's range, similar to NVIDIA's Transformer Engine but with reduced update cost due to HiF8's broader exponent coverage.

C. Inference (Post-Training Quantization):

  • Direct casting of all tensors to HiF8; per-tensor scaling and SmoothQuant (outlier folding) are applied as needed for LLMs.

  • SVDQuant provides additional outlier handling without model retraining.

HiF8’s one-format solution removes the need for dual-precision schemes such as IBM’s HFP8 (Luo et al., 2024, Zhao et al., 13 Feb 2026).

5. Comparative Analysis with Alternative 8-bit Formats

HiF8 is systematically compared with integer (INT8), block-scaled (MXFP8), and IEEE-style (E4M3/E5M2) quantization.

Format Mantissa (bits) Exponent (bits) Max Norm Min Norm Levels in [1,1][-1,1]
INT8 ±127 –128 256 uniform
E4M3 3 4 1.75·282^8 262^{-6} 113 log-spaced
E5M2 2 5 1.75·2152^{15} 2142^{-14} 89 log-spaced
MXFP8 3 (block) 4 E4M3 equiv. E4M3 equiv. 113 per-block
HiF8 0–3 (dynamic) 6–3 (dynamic) 2152^{15} 2152^{-15} 101 log-spaced

Key distinctions:

  • HiF8 achieves nearly FP16 dynamic range (38 vs 40 binades) and preserves up to 3 mantissa bits in the most frequently occupied magnitude bands.

  • INT8 achieves the highest SQNR on narrow, static weights but is inferior for activations/KV-cache with high variance and outliers, where HiF8's combination of log spacing and dynamic range yields superior performance.

  • For end-to-end low-bit inference tasks (W8A8+KV8), HiF8 avoids catastrophic failures observed with static log-FP8s and achieves a 0.3–0.5% average accuracy advantage (Zhao et al., 13 Feb 2026, Luo et al., 2024).

6. Hardware Integration and Empirical Performance

HiF8 is natively hardware-optimized for Ascend NPUs and supports direct drop-in replacement for FP8/INT8 FPGA and ASIC inference pipelines.

  • Pipeline Integration: HiF8 quantize/dequantize logic is mapped to dedicated "HIF8_Q"/"HIF8_DQ" compiler-supported instructions. Post-training quantization workflows remain unchanged apart from kernel invocation.

  • Efficiency: Achieves up to 2×2\times throughput relative to BF16 with a 50%50\% memory reduction. Latency matches optimized INT8 kernels, with zero software complexity overhead (Zhao et al., 13 Feb 2026).

  • Specialized Softmax: In BAPS attention (Ye et al., 2 Feb 2026), softmax exponentiation using HiF8 enables halving of on-chip data bandwidth and reduction of floating exponentiation area by 4×\sim 4\times, with empirical accuracy loss <1%<1\% across LLM and multimodal tasks. Average API-driven restart rates for FP32 recomputation remain <5%<5\%.

7. Empirical Results Across Neural Architectures

  • Vision/NLP:

    • Training and convergence curves for ImageNet, COCO, WMT, and MRPC are overlapping between HiF8 and FP16, with final metric deviation ≤±\pm0.4%.
    • Inference PTQ with per-tensor scaling yields ≤0.5% top-1 drop for ResNets/ViTs.
  • LLMs:
    • Training PPL (WikiText2): HiF8–FP16 differences in +0.03+0.03 to +0.13+0.13 range.
    • With ALS+PTS, HiF8 occasionally exceeds FP16 (e.g., GPT-3 6.7B: 12.99 HiF8 vs 13.06 FP16).
    • Inference PPL (LLaMA-7B, direct-cast): +0.06 loss; with PTS or SmoothQuant: ≤+0.02. OPT-7B is only viable with PTS or SmoothQuant due to outlier distribution; otherwise, loss is catastrophic (+1.60).
  • Softmax with Block-Aware Precision Rescaling:
    • LLMs: <1 percentage-point drop on typical NLP benchmarks (Qwen-3 30B, Llama-3 8B).
    • Multimodal models: Similarity, SSIM, and PSNR metrics remain high, MSE increment under 15%.
    • Throughput: End-to-end inference throughput doubles without chip area penalty (Ye et al., 2 Feb 2026, Luo et al., 2024).

References

  • Ascend HiFloat8 Format for Deep Learning (Luo et al., 2024)
  • Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats (Zhao et al., 13 Feb 2026)
  • BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling (Ye et al., 2 Feb 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HiFloat8 (HiF8).