Papers
Topics
Authors
Recent
Search
2000 character limit reached

HiFloat4 Format for Language Model Pre-training on Ascend NPUs

Published 9 Apr 2026 in cs.LG, cs.AI, and cs.CL | (2604.08826v1)

Abstract: Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in LLMs, achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.

Summary

  • The paper demonstrates that HiFloat4 significantly reduces computation and storage requirements in LLM pre-training with less than 1.5% loss relative to BF16.
  • It employs a hierarchical scaling design tailored to Ascend NPU architecture, enabling efficient dot-product execution with reduced stabilization overhead.
  • Empirical results confirm that HiFloat4 maintains a sub-1% relative error in large models like Llama3-8B and Qwen3-MoE-30B, proving its robustness across architectures.

HiFloat4 Format for Efficient LLM Pre-training on Ascend NPUs

Introduction

The computational and memory demands imposed by large-scale LLM pre-training create immediate pressures for efficient hardware utilization and aggressive numerical compression. This work systematically evaluates the HiFloat4 (HiF4) numerical format, a 4-bit floating-point representation customized for Huawei Ascend NPUs, and benchmarks it against MXFP4, the Open Compute Project's de facto FP4 standard. The study focuses on quantization strategies, stabilization mechanisms, and hardware alignment, reporting results from experiments with both dense Transformer-based LLMs and Mixture-of-Experts (MoE) architectures.

Strong empirical evidence is presented for the effectiveness of HiF4 in enabling ≈\approx90% of both computation and storage in FP4 with less than 1.5% loss relative to BF16 pre-training, while requiring less stabilization overhead than MXFP4. Key insights include the criticality of format-specific stabilization and the influence of scaling architecture on overall training dynamics and final accuracy.

FP4 Format Architecture and Stabilization Approaches

Format Synopsis and Hardware Mapping

MXFP4 and HiF4 differ fundamentally in their scaling methodology and metadata allocation. MXFP4 employs private E2M1 element coding within 32-element blocks and a shared E8M0 scale; HiF4 implements a hierarchical three-level scaling design, with a coarse E6M2 top-level scaler, two further micro-exponent modulations, and S1P2 data element encoding across 64-element blocks. This hierarchy improves dynamic range adaptation and local representational flexibility, at the cost of a slight increase in metadata. Figure 1

Figure 1

Figure 1: Structural contrast between MXFP4 (compact block scaling, minimal metadata) and HiF4 (hierarchical multi-level scaling, wider block size, and augmented metadata for improved dynamic range).

Alignment with the Ascend NPU architecture is strong: HiF4 block size is tailored to the 256-bit wide input of the Cube Unit’s processing element, permitting direct dot-product execution with minimal format conversion or reduction overhead.

Stabilization Techniques and Operator Placement

Both formats employ aggressive quantization throughout, but stabilization is achieved via distinct strategies. MXFP4 requires truncation-free block scaling, stochastic rounding (SR), and application of random Hadamard transforms (RHT) during weight-gradient computation to mitigate quantization bias, outlier misrepresentation, and value truncation. For HiF4, the inherent stability from hierarchical scaling allows for stable training with only the RHT applied to weight gradients, and nearest rounding (NR) suffices for gradients, a notable reduction in algorithmic overhead. Figure 2

Figure 2: Operational flow for quantized GEMMs highlighting the quantization mode and RHT/SR placements across both forward and backward passes, specific to each FP4 format.

Analysis of Training Dynamics and Empirical Results

Model Coverage and Experimental Setup

Experiments were conducted with three architectures:

  • Dense Transformers: OpenPangu-1B and Llama3-8B.
  • Mixture-of-Experts: Qwen3-MoE-30B.

The range of models ensures that findings are robust across both dense and sparsely activated large-scale systems. FP4 computation fraction is consistently high, achieving up to 96% in the MoE setting.

Training Loss Evolution and Scaling Effects

HiF4 consistently achieves lower loss gap compared to MXFP4 across all model scales, with performance difference tightening at larger scales. For Llama3-8B and Qwen3-MoE-30B, the relative error remains below 1% for HiF4, confirming its efficacy in large models. Figure 3

Figure 3: Loss curves for HiF4 and MXFP4 versus BF16 baseline—relative loss gap narrows with increasing model size, and HiF4 remains systematically closer to full-precision performance.

Ablation on Stabilization Components

Analysis of individual and combined stabilization strategies demonstrates that MXFP4’s reliance on all three techniques (RHT, SR, truncation-free scaling) is essential for sub-1.5% loss. HiF4, in contrast, is largely unaffected by SR, and its optimal tuning involves only RHT for weight gradients. This asymmetry is traced to HiF4’s scaling architecture that circumvents block-level value truncation and reduces block-wise representational collapse.

Practical and Theoretical Implications

Hardware Alignment

HiF4’s design parallels the width and data layout of Ascend NPU’s Cube Units, optimizing compute and memory traffic on domain-specific processors. This stands in contrast with NVFP4 and FP8, whose block sizes and scaling factors map less naturally to specialized hardware, resulting in higher energy and area costs.

Impact for Dense and Sparse Architectures

The demonstrated ability to train both dense and MoE models stably in near-end-to-end FP4 brings immediate relevance to efficiency-constrained scenarios such as edge deployment and training at scale under power/cost restrictions. The significant idle parameter footprint in MoE models further amplifies storage benefits without engaging compute in inactive expert branches.

Future Trajectories

The robust results for FP4 pre-training suggest the next research directions: extending ultra-low precision schemes to reinforcement learning (e.g., RLHF, GRPO), where optimization is highly sensitive to quantization noise, and examining stability in long-context or multimodal training regimes where cross-modal variance and activation heterogeneity may present further challenges.

Conclusion

This study establishes HiF4 as a highly competitive FP4 format for LLM pre-training, especially on Ascend NPU platforms. HiF4’s hierarchical scaling inherently stabilizes optimization and obviates much of the overhead necessary for other FP4 formats. As model scales increase, HiF4 maintains minimal loss degradation, making aggressive low-precision quantization an increasingly tenable strategy for efficient large model training. These findings position hardware-tailored numerical formats combined with format-aware stabilization as key to practical, energy-efficient foundation model development.


References:

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 31 likes about this paper.