PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation (2104.12369v1)

Published 26 Apr 2021 in cs.CL

Abstract: Large-scale Pretrained LLMs (PLMs) have become the new paradigm for NLP. PLMs with hundreds of billions parameters such as GPT-3 have demonstrated strong performances on natural language understanding and generation with \textit{few-shot in-context} learning. In this work, we present our practice on training large-scale autoregressive LLMs named PanGu-$\alpha$, with up to 200 billion parameters. PanGu-$\alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors. The training parallelism strategy is implemented based on MindSpore Auto-parallel, which composes five parallelism dimensions to scale the training task to 2048 processors efficiently, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism and rematerialization. To enhance the generalization ability of PanGu-$\alpha$, we collect 1.1TB high-quality Chinese data from a wide range of domains to pretrain the model. We empirically test the generation ability of PanGu-$\alpha$ in various scenarios including text summarization, question answering, dialogue generation, etc. Moreover, we investigate the effect of model scales on the few-shot performances across a broad range of Chinese NLP tasks. The experimental results demonstrate the superior capabilities of PanGu-$\alpha$ in performing various tasks under few-shot or zero-shot settings.

PDF Abstract

Overview of PanGu-Σ: Large-scale Autoregressive Pretrained Chinese LLMs with Auto-parallel Computation

The paper presents PanGu-Σ, a set of large-scale autoregressive pretrained Chinese LLMs built with an advanced auto-parallel computation framework. Designed with up to 200 billion parameters, PanGu-Σ showcases significant advancements in NLP, particularly in the domains of Chinese language understanding and generation. This work is a response to the growing need for sophisticated LLMs that can adeptly handle Chinese due to its complex semantics and syntactic structure.

Model Architecture and Training Strategy

The PanGu-Σ architecture is predicated on the principles of GPT-like autoregressive models, focusing on efficient training and scaling strategies. The model capitalizes on the MindSpore platform's auto-parallel capabilities, which integrate five dimensions of parallelism—data, operation-level, pipeline, optimizer, and rematerialization—to optimize computational workloads across 2048 Ascend 910 AI processors. This robust parallelism is pivotal for managing the extensive computational demands of training such a large-scale model.

Data Collection and Pretraining

The training of PanGu-Σ is underpinned by a colossal corpus of 1.1TB of high-quality Chinese text. This diverse dataset spans numerous domains, ensuring the model's broad applicability and generalization capabilities. The dataset's expanse is vital in pretraining phases, facilitating few-shot and zero-shot learning scenarios due to the model's exposed breadth of linguistic structures and semantics.

Experiments and Results

The paper details empirical evaluations across a spectrum of NLP tasks, including text summarization, question answering, and dialogue generation. The model is benchmarked on its few-shot performance across these tasks, demonstrating superior capabilities compared to previous models. The results underscore the critical role of model scaling in enhancing LLM performance, particularly in low-resource settings, thereby facilitating effective handling of NLP tasks with minimal task-specific supervision.

Implications and Future Directions

The introduction of PanGu-Σ has important implications for both practical applications and theoretical advancements within AI research. Practically, it offers a powerful toolset for developers working on Chinese NLP applications, such as interactive AI, content generation, and language translation. Theoretically, it sets a precedent for future LLM architectures, especially in exploring optimal strategies for large-scale model training and efficient parallelism.

Future research avenues may include exploring the limitations and potential biases inherent in such large-scale models, particularly in diverse linguistic contexts beyond Chinese. Furthermore, the integration of multimodal capabilities and extending model architectures to handle disparate data types could herald significant advancements in AI's interaction capabilities with human language and understanding.

In conclusion, PanGu-Σ represents a substantial contribution to the field of large-scale LLMs with innovative auto-parallel training strategies, marking a benchmark in advancing Chinese LLMing and setting a foundation for future research developments.

PDF Markdown Bookmark Chat (Pro)

Authors (38)

Wei Zeng (95 papers)
Xiaozhe Ren (21 papers)
Teng Su (5 papers)
Hui Wang (371 papers)
Yi Liao (87 papers)
Zhiwei Wang (223 papers)
Xin Jiang (242 papers)
ZhenZhang Yang (1 paper)
Kaisheng Wang (1 paper)
Xiaoda Zhang (3 papers)
Chen Li (386 papers)
Ziyan Gong (1 paper)
Yifan Yao (11 papers)
Xinjing Huang (4 papers)
Jun Wang (991 papers)
Jianfeng Yu (3 papers)
Qi Guo (237 papers)
Yue Yu (343 papers)
Yan Zhang (954 papers)
Jin Wang (356 papers)

Citations (190)

View on Semantic Scholar