UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training (2002.12804v1)

Published 28 Feb 2020 in cs.CL

Abstract: We propose to pre-train a unified LLM for both autoencoding and partially autoregressive LLMing tasks using a novel training procedure, referred to as a pseudo-masked LLM (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive LLMing. In addition, the two tasks pre-train a unified LLM as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified LLMs pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

PDF Abstract

An Examination of UniLMv2: Pseudo-Masked LLMs for Unified LLM Pre-Training

The paper "UniLMv2: Pseudo-Masked LLMs for Unified LLM Pre-Training" by Hangbo Bao et al. presents a novel approach for pre-training a unified LLM suitable for both natural language understanding (NLU) and generation (NLG) tasks. The proposed model, termed UniLMv2, introduces a pseudo-masked LLM (PMLM) which effectively harnesses both autoencoding (AE) and partially autoregressive (PAR) LLMing tasks in a single framework.

Key Contributions

The central innovation of this work lies in the pseudo-masked LLM (PMLM) training procedure. This method conceptually bridges between the autoencoding approaches seen in models like BERT and autoregressive models demonstrated by GPT, by integrating partially autoregressive factorization into the pre-training process. The essence of PMLM is the use of pseudo masks, which allow for efficient learning of long-distance dependencies without the redundancy typically associated with independently computed autoencoding and autoregressive models.

Robust Methodology

One of the strengths of this paper is the detailed comparison of different pre-training objectives, namely, autoencoding, autoregressive, and partially autoregressive modeling. The authors elucidate how the joint training of AE and PAR tasks contributes to a comprehensive learning process that captures the inter-relations and intra-relations of masked tokens effectively. By using blockwise masking and factorization, the PMLM is structured to improve long-range dependency learning—a notable advance over conventional autoregressive models which often focus on immediate preceding words.

Experimental Validation

The empirical results provided are extensive, demonstrating that the UniLMv2 model achieves state-of-the-art performance on a variety of benchmarks. On tasks such as SQuAD 1.1 and 2.0, and the General Language Understanding Evaluation (GLUE) benchmark, UniLMv2 exhibits superior performance compared to existing models like BERT, XLNet, and RoBERTa. For the abstractive summarization tasks using CNN/DailyMail and XSum, and question generation, UniLMv2 also outperforms or is on par with competing models. These results underscore the effectiveness of the pseudo-masked training approach for both understanding and generation tasks.

Practical and Theoretical Implications

Practically, this research presents a methodological advancement that could significantly enhance the efficiency and performance of future LLMs in both industry applications and academic research. Theoretically, it stimulates further exploration into the potential of unified modeling approaches, opening avenues for new pre-training strategies that leverage both AE and PAR paradigms.

Speculations for Future Work

The promising results of UniLMv2 suggest several directions for future research. Enhanced exploration into optimizing the masking strategy and factorization order could yield further improvements. Additionally, investigating the interaction between different pre-training goals within the framework of a unified model might reveal deeper insights into the nature of human LLMing.

Conclusion

In conclusion, "UniLMv2: Pseudo-Masked LLMs for Unified LLM Pre-Training" offers a valuable contribution to the growing field of LLM pre-training by demonstrating that a unified approach utilizing pseudo-masking can achieve new levels of efficiency and effectiveness. The results of this work provide a compelling case for the continued exploration of hybrid modeling strategies that employ both autoencoding and autoregressive techniques for comprehensive LLM training.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Hangbo Bao (17 papers)
Li Dong (154 papers)
Furu Wei (291 papers)
Wenhui Wang (47 papers)
Nan Yang (182 papers)
Xiaodong Liu (162 papers)
Yu Wang (939 papers)
Songhao Piao (9 papers)
Jianfeng Gao (344 papers)
Ming Zhou (182 papers)
Hsiao-Wuen Hon (3 papers)

Citations (376)

View on Semantic Scholar