GLM-130B: An Open Bilingual Pre-trained Model (2210.02414v2)

Published 5 Oct 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce GLM-130B, a bilingual (English and Chinese) pre-trained LLM with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese LLM -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

PDF Abstract

Overview of GLM-130B: An Open Bilingual Pre-Trained Model

The paper introduces GLM-130B, a bilingual (English and Chinese) LLM with 130 billion parameters. The authors address both the technical and engineering challenges faced during its development, providing a comprehensive overview of the design decisions, training strategies, and evaluation methodologies. This model aims to rival existing large-scale models such as GPT-3, OPT-175B, and BLOOM-176B while offering additional support for Chinese.

Model Design

GLM-130B utilizes a bidirectional General LLM (GLM) architecture, diverging from the traditional unidirectional GPT-style architectures. This bidirectional attention mechanism, combined with autoregressive blank infilling, allows for robust context comprehension. The model operates with both [MASK] and [gMASK] tokens to differentiate between short and long spans of masked text. The authors adopt DeepNorm-based Post-LN to stabilize the training and Rotary Positional Encoding (RoPE) for enhanced contextual understanding.

Training Methodology

The model was trained on a diverse dataset comprising 1.2T English tokens from Pile and 1.0T Chinese tokens from WudaoCorpora, among others. The training included both self-supervised autoregressive blank infilling and Multi-task Instruction Pre-training (MIP), where 5% of the tokens were sourced from 74 prompted datasets. These datasets ranged from natural language inference and closed-book QA to text summarization. Importantly, the authors incorporated gradient shrinkage on the embedding layer to mitigate gradient spikes, ensuring training stability.

Performance and Evaluation

GLM-130B consistently outperformed other models, including GPT-3 175B, OPT-175B, and BLOOM-176B across various benchmarks. For instance, it achieved higher accuracy on the LAMBADA dataset and notable improvements on MMLU and BIG-bench-lite benchmarks. Specifically, on LAMBADA, GLM-130B reached an accuracy of 80.2%, surpassing both GPT-3 and PaLM 540B.

On Chinese benchmarks, GLM-130B also excelled, outperforming the largest Chinese LLM, ERNIE TITAN 3.0 260B, by significant margins on CLUE and FewCLUE datasets. This evidences GLM-130B's robust bilingual capabilities.

Quantization and Inference Efficiency

A noteworthy aspect of GLM-130B is its application of INT4 weight quantization, a first for 100B-scale models. This allows for efficient inference on more accessible GPUs, such as 4 × RTX 3090 (24G) or 8 × RTX 2080 Ti (11G), significantly lowering the hardware barrier for utilizing large-scale models. The quantization was achieved without post-training and resulted in minimal performance loss.

Implications and Future Directions

GLM-130B pushes the boundaries of what is achievable in open-source large-scale LLMs. By making the model and its training methodology publicly accessible, the authors address the opacity often associated with proprietary models like GPT-3 and PaLM. This openness can catalyze further advancements in the field, enabling more researchers to experiment with and build upon this model.

The implications of GLM-130B are multifaceted. It sets a precedent for future bilingual models, potentially paving the way for multilingual pre-trained models. The application of INT4 quantization opens new avenues for inference optimization, making high-performance models more accessible.

Future developments may include expanding the range of supported languages and further refining training stability techniques. Additionally, exploring other quantization strategies and optimizing the training process could yield even more efficient models.

Conclusion

GLM-130B represents a significant advancement in the development of large-scale, open-source LLMs. Its robust performance across diverse benchmarks, combined with its bilingual capabilities, makes it a valuable contribution to the field. The practical implications of its accessible training and inference methodologies hold the potential for widespread impact on both academic research and industry applications. GLM-130B underscores the importance of transparency and accessibility in advancing AI research, setting a strong foundation for future explorations in LLMing.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Aohan Zeng (19 papers)
Xiao Liu (402 papers)
Zhengxiao Du (22 papers)
Zihan Wang (181 papers)
Hanyu Lai (11 papers)
Ming Ding (219 papers)
Zhuoyi Yang (18 papers)
Yifan Xu (92 papers)
Wendi Zheng (12 papers)
Xiao Xia (4 papers)
Weng Lam Tam (8 papers)
Zixuan Ma (6 papers)
Yufei Xue (9 papers)
Jidong Zhai (24 papers)
Wenguang Chen (21 papers)
Peng Zhang (641 papers)
Yuxiao Dong (119 papers)
Jie Tang (302 papers)

Citations (1,024)

View on Semantic Scholar