HiFloat8: Adaptive 8-Bit Deep Learning Format
- HiFloat8 is an 8-bit floating-point format with a tapered-precision design that dynamically adjusts the allocation between exponent and mantissa bits.
- It supports both training and inference for CNNs, transformers, and LLMs by nearly matching FP16 dynamic range while reducing memory and computational overhead.
- Hardware-optimized for Ascend NPUs, HiFloat8 offers robust quantization, efficient rounding modes, and seamless integration with mixed-precision workflows.
HiFloat8 (HiF8) is an 8-bit floating-point data format specifically engineered for modern deep learning workloads, offering a tapered-precision scheme that dynamically trades off between precision and dynamic range. HiF8 was initially proposed for the Ascend AI accelerator family, and has since been systematically evaluated across training and inference for convolutional neural networks (CNNs), transformers, and LLMs. The core principle underpinning HiF8 is its adaptive allocation of mantissa and exponent bits, achieving higher precision where most critical, while nearly matching the exceptional dynamic range of FP16—addressing the limitations of static 8-bit floating-point alternatives and integer quantization schemes (Luo et al., 2024, Zhao et al., 13 Feb 2026, Ye et al., 2 Feb 2026).
1. Bit-Level Structure and Encoding
HiF8 uses 8 bits organized with non-uniform field allocation, separating it from classical IEEE-754 derivatives and block-scaled formats.
- Fields and Mode Selection:
- Sign (S/s): 1 bit (0 for positive, 1 for negative).
- Taper ("Dot") field: 2–4 bits; uniquely prefix-coded to signal five distinct modes (D ∈ {4,3,2,1,0}) or a denormal (DML) mode.
- Exponent (Em): D bits (sign-magnitude, no bias), with D chosen by the dot-code.
- Mantissa (M): (5–D) bits in normal modes; in DML (denormal), the mantissa functions as an exponent extension.
The general encoding for a normal number is: where is the decoded signed exponent, and is the mantissa bit width. For denormals, the mantissa encodes a biased exponent, extending range deeper into small magnitudes:
- Dynamic Range: covers [, ], for a total of $38$ exponent binades (versus FP16's $40$).
- Zero/Infinity/NaN: All special values except dual-encoded zeros. Zero is unique; the largest normal codes yield and NaN.
The allocation of exponent and mantissa bits is summarized as follows:
| Mode D | Exponent binade (E) | Exponent bits | Mantissa bits | Description |
|---|---|---|---|---|
| DML | [–22, –16] | 0 | 0 (biased ext.) | Denormal |
| 4 | [–15, –8] | 4 | 1 | Wide range |
| 3 | [–7, –4] | 3 | 2 | Medium band |
| 2 | [–3, –2] | 2 | 3 | Fine band |
| 1 | [–1, +1] | 1 | 3 | Center band |
Mantissa granularity thus increases as exponent magnitude decreases, implementing tapered precision (Luo et al., 2024, Zhao et al., 13 Feb 2026).
2. Tapered-Precision and Quantization
HiF8's variable-precision mechanism dynamically adjusts the division between exponent and mantissa, allocating more mantissa bits for small exponents where relative error is more impactful.
- Tapered Precision: Fractional bits (P) allocated to mantissa vary as a staircase with respect to exponent binade . Centered values (E ≈ 0) have up to 3 mantissa bits, ensuring higher fidelity for most neural network activations and gradients.
- Quantization/Dequantization (as implemented):
- Given :
- Compute exponent .
- Select mantissa bits according to :
- if ,
- if ,
- if ,
- otherwise.
- Quantize:
- Dequantize:
Notably, for extreme exponents the representation degenerates to a pure power-of-two, maximizing range at the cost of fractional resolution (Luo et al., 2024, Zhao et al., 13 Feb 2026, Ye et al., 2 Feb 2026).
3. Rounding Modes and Conversion Workflow
Casting from higher-precision floating-point (FP32/FP16/BF16) to HiF8 supports two principal rounding modes for optimal training and inference:
Round-Half-Away (TA): Nearest rounding with ties rounded away from zero; enables marginally better AI-training accuracy and simpler hardware than round-to-even.
Hybrid Rounding (HR): TA applied when ; otherwise, a simplified, threshold-based stochastic rounding (using fixed-width thresholds rather than RNGs) approximates 1 ulp or 0.75 ulp accuracy.
Overflow Handling: Clamp to maximum representable value, with optional NaN-to-zero saturation.
For forward pass, TA is used exclusively; backward pass supports both TA and HR according to gradient distribution characteristics (Luo et al., 2024).
4. Training and Inference Protocols
HiF8 is compatible with both traditional and LLM models, directly mirroring established mixed-precision workflows but benefiting from improved dynamic range and quantization adaptability.
A. Standard Deep Network Training:
All core GEMM inputs (activations, weights, activation gradients) are stored in HiF8.
Accumulation is performed in FP16.
All other numeric operations use FP32 or FP16 as appropriate.
Gradient underflow is prevented via global backward loss-scaling.
B. LLM Training:
Backward Loss-Scaling (BLS): As above.
Adaptive Loss-Scaling (ALS): Dynamically adapts loss-scale window for gradient distribution.
Per-Tensor Scaling (PTS): Power-of-two scaling maintained per GEMM input, updated periodically to ensure optimal coverage of HiF8's range, similar to NVIDIA's Transformer Engine but with reduced update cost due to HiF8's broader exponent coverage.
C. Inference (Post-Training Quantization):
Direct casting of all tensors to HiF8; per-tensor scaling and SmoothQuant (outlier folding) are applied as needed for LLMs.
SVDQuant provides additional outlier handling without model retraining.
HiF8’s one-format solution removes the need for dual-precision schemes such as IBM’s HFP8 (Luo et al., 2024, Zhao et al., 13 Feb 2026).
5. Comparative Analysis with Alternative 8-bit Formats
HiF8 is systematically compared with integer (INT8), block-scaled (MXFP8), and IEEE-style (E4M3/E5M2) quantization.
| Format | Mantissa (bits) | Exponent (bits) | Max Norm | Min Norm | Levels in |
|---|---|---|---|---|---|
| INT8 | — | — | ±127 | –128 | 256 uniform |
| E4M3 | 3 | 4 | 1.75· | 113 log-spaced | |
| E5M2 | 2 | 5 | 1.75· | 89 log-spaced | |
| MXFP8 | 3 (block) | 4 | E4M3 equiv. | E4M3 equiv. | 113 per-block |
| HiF8 | 0–3 (dynamic) | 6–3 (dynamic) | 101 log-spaced |
Key distinctions:
HiF8 achieves nearly FP16 dynamic range (38 vs 40 binades) and preserves up to 3 mantissa bits in the most frequently occupied magnitude bands.
INT8 achieves the highest SQNR on narrow, static weights but is inferior for activations/KV-cache with high variance and outliers, where HiF8's combination of log spacing and dynamic range yields superior performance.
For end-to-end low-bit inference tasks (W8A8+KV8), HiF8 avoids catastrophic failures observed with static log-FP8s and achieves a 0.3–0.5% average accuracy advantage (Zhao et al., 13 Feb 2026, Luo et al., 2024).
6. Hardware Integration and Empirical Performance
HiF8 is natively hardware-optimized for Ascend NPUs and supports direct drop-in replacement for FP8/INT8 FPGA and ASIC inference pipelines.
Pipeline Integration: HiF8 quantize/dequantize logic is mapped to dedicated "HIF8_Q"/"HIF8_DQ" compiler-supported instructions. Post-training quantization workflows remain unchanged apart from kernel invocation.
Efficiency: Achieves up to throughput relative to BF16 with a memory reduction. Latency matches optimized INT8 kernels, with zero software complexity overhead (Zhao et al., 13 Feb 2026).
Specialized Softmax: In BAPS attention (Ye et al., 2 Feb 2026), softmax exponentiation using HiF8 enables halving of on-chip data bandwidth and reduction of floating exponentiation area by , with empirical accuracy loss across LLM and multimodal tasks. Average API-driven restart rates for FP32 recomputation remain .
7. Empirical Results Across Neural Architectures
Vision/NLP:
- Training and convergence curves for ImageNet, COCO, WMT, and MRPC are overlapping between HiF8 and FP16, with final metric deviation ≤0.4%.
- Inference PTQ with per-tensor scaling yields ≤0.5% top-1 drop for ResNets/ViTs.
- LLMs:
- Training PPL (WikiText2): HiF8–FP16 differences in to range.
- With ALS+PTS, HiF8 occasionally exceeds FP16 (e.g., GPT-3 6.7B: 12.99 HiF8 vs 13.06 FP16).
- Inference PPL (LLaMA-7B, direct-cast): +0.06 loss; with PTS or SmoothQuant: ≤+0.02. OPT-7B is only viable with PTS or SmoothQuant due to outlier distribution; otherwise, loss is catastrophic (+1.60).
- Softmax with Block-Aware Precision Rescaling:
- LLMs: <1 percentage-point drop on typical NLP benchmarks (Qwen-3 30B, Llama-3 8B).
- Multimodal models: Similarity, SSIM, and PSNR metrics remain high, MSE increment under 15%.
- Throughput: End-to-end inference throughput doubles without chip area penalty (Ye et al., 2 Feb 2026, Luo et al., 2024).
References
- Ascend HiFloat8 Format for Deep Learning (Luo et al., 2024)
- Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats (Zhao et al., 13 Feb 2026)
- BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling (Ye et al., 2 Feb 2026)