GLM: General Language Model Pretraining with Autoregressive Blank Infilling (2103.10360v2)

Published 18 Mar 2021 in cs.CL, cs.AI, and cs.LG

Abstract: There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General LLM (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25x parameters of BERT Large , demonstrating its generalizability to different downstream tasks.

PDF Abstract

GLM: General LLM Pretraining with Autoregressive Blank Infilling

The paper presents a novel approach to pretraining LLMs with a General LLM (GLM) based on autoregressive blank infilling. This method aims to overcome the limitations of existing pretraining frameworks, which include autoencoding models like BERT, autoregressive models such as GPT, and encoder-decoder models like T5. None of the previous frameworks demonstrate optimal performance across a variety of NLP tasks, particularly those involving natural language understanding (NLU), unconditional generation, and conditional generation.

The proposed GLM framework leverages an autoregressive blank infilling mechanism to fill in continuous spans of missing text from an input sequence. This approach incorporates span shuffling and 2D positional encoding. By adopting these techniques, GLM aims to enhance task performance across a broad spectrum of applications, including NLU and text generation.

Key Contributions and Methodology

Autoregressive Blank Infilling:
- GLM is designed to fill in randomly masked continuous spans of text in an input sequence.
- It uses an autoregressive approach, where the model predicts one span at a time sequentially, while also encoding 2D positional information to keep track of both inter- and intra-span positions.
Multi-Task Pretraining:
- Offers versatility by pretraining on various tasks with different lengths and numbers of masked spans.
- Document-level and sentence-level objectives are introduced to pretrain for both long text generation and sequence-to-sequence tasks.
2D Positional Encoding:
- Each token is assigned two positional IDs to capture its position in the corrupted text and its position within the masked span.
- This positional encoding approach ensures that the model can effectively handle variable-length spans without prior knowledge of their lengths.
Unified Encoder and Decoder:
- GLM employs a single Transformer model that functions as both a bidirectional encoder and a unidirectional decoder, optimized for the blank infilling task.

Empirical Results

The paper provides extensive experimental results demonstrating GLM's performance on several tasks. The most notable results include:

SuperGLUE Benchmark:
- GLMBase outperforms BERTBase by 4.6% on average.
- GLMLarge surpasses BERTLarge by 5.0%, indicating significant gains in NLU tasks.
- The model also shows competitive performance compared to state-of-the-art models like T5, BART, and RoBERTa, achieving superior results with fewer parameters.
Sequence-to-Sequence Tasks:
- In tasks such as Gigaword summarization and SQuAD question generation, GLM achieves competitive to superior performance compared to existing models like MASS and UniLM.
Text Infilling:
- On the Yahoo Answers dataset, GLM significantly outperforms prior methods such as BERT and BLM, showcasing its effectiveness in generating coherent text spans.
LLMing:
- GLM demonstrates robust performance in zero-shot perplexity and accuracy on the LAMBADA dataset, further validating its capacity for long-range dependency modeling.

Implications and Future Developments

The introduction of GLM has several important implications:

Unified Pretraining Framework: By effectively combining autoencoding and autoregressive objectives, GLM presents a versatile solution that can be pretrained once and applied across various downstream tasks, simplifying the model deployment process.
Enhanced Flexibility: The use of 2D positional encoding and span shuffling provides GLM with the capability to handle arbitrary lengths and numbers of masked spans, which is crucial for practical applications involving variable-length text input and output.
Scalability: GLM demonstrates that it can achieve high performance even with fewer parameters compared to models like T5, highlighting its efficiency and scalability.

Looking ahead, future research could explore further fine-tuning techniques and additional pretraining objectives to enhance model performance. GLM's autoregressive blank infilling mechanism also opens doors to potential improvements in areas like interactive text generation and dynamic knowledge integration.

By demonstrating significant improvements across multiple benchmarks and tasks, the GLM framework represents a noteworthy advancement in the development of general pretraining methods for LLMs. It sets a precedent for future models to integrate the strengths of different pretraining objectives into a single unified framework.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zhengxiao Du (22 papers)
Yujie Qian (12 papers)
Xiao Liu (402 papers)
Ming Ding (219 papers)
Jiezhong Qiu (29 papers)
Zhilin Yang (50 papers)
Jie Tang (302 papers)

Citations (1,289)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos