Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs (2504.07866v2)

Published 10 Apr 2025 in cs.CL and cs.AI

Abstract: We present Pangu Ultra, a LLM with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.

PDF Abstract

Pushing the Limits of Dense LLMs on Ascend NPUs

The paper under discussion introduces a noteworthy advancement in the field of LLMs with the development of a dense Transformer architecture comprising 135 billion parameters, trained on Ascend Neural Processing Units (NPUs). In the context of recent LLM research, the authors have addressed significant optimization and training stability challenges associated with handling vast parameter scales. This is achieved through innovations such as depth-scaled sandwich normalization, which alleviates training instability characterized by loss spikes common in deep models.

The model's infrastructure is built upon 8,192 Ascend NPUs, which, coupled with systematic architecture and procedural optimizations, yielded competitive efficiency. The developers have calibrated the computing setup with various parallelism strategies like Data Parallelism, Tensor Parallelism, Sequence Parallelism, and Pipeline Parallelism to effectively utilize the full spectrum of processing capabilities across the cluster. A critical feature of the implementation is a sophisticated virtual pipeline scheduling mechanism, which significantly reduces the pipeline bubble ratio, addressing one of the recurring efficiency bottlenecks in large-scale dense model training.

The authors highlight the model's pre-training process which utilizes a diverse dataset of 13.2 trillion tokens. This dataset is selectively curated for quality, spanning multiple domains such as general text, code, and mathematics. Notably, the training process employs a multi-phase approach that incrementally extends context window size, demonstrating a profound capacity for long context comprehension up to 128k tokens.

Performance evaluations indicate that the proposed model not only outperforms other dense models like Llama 405B and Mistral Large 2 on various benchmarks but also competes effectively with sparse architectures like DeepSeek-R1, achieving state-of-the-art results. It is noteworthy that the novel depth-scaled sandwich norm and tiny initialization techniques contributed significantly to maintaining training stability by managing gradient fluctuations, which is imperative for sustaining model performance in deep networks.

Xu et al.'s exploration is a testament to the viability and potential of Ascend NPUs in facilitating large-scale dense LLM architectures with substantial computational efficiency, as demonstrated by achieving a Model FLOPs Utilization (MFU) of over 50%. Looking forward, the model's refined reasoning capabilities post-training via supervised fine-tuning and reinforcement learning accentuate the potential for dense models to achieve equivalency or even superiority over their sparse counterparts in specific domains.

The advancements outlined in this paper have pragmatic implications for the deployment of sophisticated LLMs capable of handling intricate tasks across diverse application areas while maintaining performance fidelity. Future considerations might include leveraging insights from this paper to further explore parameter scaling laws, architecture refinements, and broader adoption of innovative normalization techniques to address emerging challenges in the deployment and training of artificial intelligence models.

PDF Markdown Bookmark Chat (Pro)

Authors (52)

Yichun Yin (27 papers)
Wenyong Huang (12 papers)
Kaikai Song (3 papers)
Yehui Tang (63 papers)
Xueyu Wu (3 papers)
Wei Guo (221 papers)
Peng Guo (78 papers)
Yaoyuan Wang (18 papers)
Xiaojun Meng (23 papers)
Yasheng Wang (91 papers)
Dong Li (429 papers)
Can Chen (64 papers)
Dandan Tu (25 papers)
Yin Li (150 papers)
Fisher Yu (104 papers)
Ruiming Tang (171 papers)
Yunhe Wang (145 papers)
Baojun Wang (14 papers)
Bin Wang (750 papers)
Bo Wang (823 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1911751634247365060

https://twitter.com/ogawa_tter/status/1911752864206405967

https://twitter.com/Synced_Global/status/1910602152084414769

YouTube

Show All Videos