Overview of GLM-130B: An Open Bilingual Pre-Trained Model
The paper introduces GLM-130B, a bilingual (English and Chinese) LLM with 130 billion parameters. The authors address both the technical and engineering challenges faced during its development, providing a comprehensive overview of the design decisions, training strategies, and evaluation methodologies. This model aims to rival existing large-scale models such as GPT-3, OPT-175B, and BLOOM-176B while offering additional support for Chinese.
Model Design
GLM-130B utilizes a bidirectional General LLM (GLM) architecture, diverging from the traditional unidirectional GPT-style architectures. This bidirectional attention mechanism, combined with autoregressive blank infilling, allows for robust context comprehension. The model operates with both [MASK] and [gMASK] tokens to differentiate between short and long spans of masked text. The authors adopt DeepNorm-based Post-LN to stabilize the training and Rotary Positional Encoding (RoPE) for enhanced contextual understanding.
Training Methodology
The model was trained on a diverse dataset comprising 1.2T English tokens from Pile and 1.0T Chinese tokens from WudaoCorpora, among others. The training included both self-supervised autoregressive blank infilling and Multi-task Instruction Pre-training (MIP), where 5% of the tokens were sourced from 74 prompted datasets. These datasets ranged from natural language inference and closed-book QA to text summarization. Importantly, the authors incorporated gradient shrinkage on the embedding layer to mitigate gradient spikes, ensuring training stability.
Performance and Evaluation
GLM-130B consistently outperformed other models, including GPT-3 175B, OPT-175B, and BLOOM-176B across various benchmarks. For instance, it achieved higher accuracy on the LAMBADA dataset and notable improvements on MMLU and BIG-bench-lite benchmarks. Specifically, on LAMBADA, GLM-130B reached an accuracy of 80.2%, surpassing both GPT-3 and PaLM 540B.
On Chinese benchmarks, GLM-130B also excelled, outperforming the largest Chinese LLM, ERNIE TITAN 3.0 260B, by significant margins on CLUE and FewCLUE datasets. This evidences GLM-130B's robust bilingual capabilities.
Quantization and Inference Efficiency
A noteworthy aspect of GLM-130B is its application of INT4 weight quantization, a first for 100B-scale models. This allows for efficient inference on more accessible GPUs, such as 4 × RTX 3090 (24G) or 8 × RTX 2080 Ti (11G), significantly lowering the hardware barrier for utilizing large-scale models. The quantization was achieved without post-training and resulted in minimal performance loss.
Implications and Future Directions
GLM-130B pushes the boundaries of what is achievable in open-source large-scale LLMs. By making the model and its training methodology publicly accessible, the authors address the opacity often associated with proprietary models like GPT-3 and PaLM. This openness can catalyze further advancements in the field, enabling more researchers to experiment with and build upon this model.
The implications of GLM-130B are multifaceted. It sets a precedent for future bilingual models, potentially paving the way for multilingual pre-trained models. The application of INT4 quantization opens new avenues for inference optimization, making high-performance models more accessible.
Future developments may include expanding the range of supported languages and further refining training stability techniques. Additionally, exploring other quantization strategies and optimizing the training process could yield even more efficient models.
Conclusion
GLM-130B represents a significant advancement in the development of large-scale, open-source LLMs. Its robust performance across diverse benchmarks, combined with its bilingual capabilities, makes it a valuable contribution to the field. The practical implications of its accessible training and inference methodologies hold the potential for widespread impact on both academic research and industry applications. GLM-130B underscores the importance of transparency and accessibility in advancing AI research, setting a strong foundation for future explorations in LLMing.