XLNet: Generalized Autoregressive Pretraining for Language Understanding (1906.08237v2)

Published 19 Jun 2019 in cs.CL and cs.LG

Abstract: With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive LLMing. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

PDF Abstract

XLNet: Generalized Autoregressive Pretraining for Language Understanding

The paper "XLNet: Generalized Autoregressive Pretraining for Language Understanding," authored by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le, introduces a novel pretraining method in the field of NLP. The primary objective of this method, termed XLNet, is to address the limitations seen in previous pretraining models like BERT while leveraging the distinct advantages of both autoregressive (AR) and autoencoding (AE) LLMs. In this essay, we provide a concise yet detailed overview of the key contributions, methodologies, and implications of this paper.

Background and Motivation

Pretrained LLMs such as BERT and autoregressive models like GPT have demonstrated significant advancements in various NLP tasks. However, each comes with inherent limitations. BERT's AE approach achieves bidirectional context modeling by reconstructing original text from a corrupted input but encounters a pretrain-finetune discrepancy due to the reliance on masked tokens. Conversely, AR models like GPT are naturally aligned with density estimation without any pretrain-finetune disconnect but fail to capture deep bidirectional contexts due to their factorization order constraints.

XLNet Proposal

XLNet is proposed to harness the benefits of both AR and AE while overcoming their respective limitations. It introduces a generalized autoregressive pretraining method, characterized by the following innovations:

Permuted LLMing:
- Unlike traditional AR models that use fixed forward or backward factorization orders, XLNet maximizes the expected log-likelihood over all possible permutations of the factorization order. This ensures that the model captures bidirectional context information by learning from all possible orderings of tokens.
Integration with Transformer-XL:
- XLNet incorporates architectural advancements from Transformer-XL, such as segment recurrence and relative positional encoding. These techniques improve the model’s ability to handle long text sequences, thereby enhancing its performance on tasks involving extended context.
Two-Stream Self-Attention Mechanism:
- The paper introduces a novel two-stream self-attention mechanism to address target-aware prediction. The model computes two sets of hidden representations: a content stream (standard self-attention) and a query stream (excluding the token itself but including positional information). This design helps preserve context-awareness across varying factorization orders during training.

Experimental Results

Empirical evaluations demonstrate that XLNet significantly outperforms BERT across multiple NLP benchmarks, including but not limited to, question answering (SQuAD), natural language inference (MNLI), document ranking (ClueWeb09-B), and sentiment analysis (IMDB, Yelp).

In fair comparisons with BERT, XLNet-Large consistently shows substantial gains, often by significant margins, across diverse tasks.
On scaling up the model and utilizing additional datasets (e.g., Giga5, ClueWeb, Common Crawl), XLNet sets new state-of-the-art results compared to contemporaries like RoBERTa.

Implications and Future Directions

The implications of XLNet are both practical and theoretical:

Practical Implications:
- The permutation-based autoregressive approach in XLNet addresses critical pain points such as the pretrain-finetune discrepancy seen in models like BERT. This enhances the robustness and generalization of pretrained models in real-world applications.
- XLNet’s ability to manage longer sequences better makes it particularly valuable for tasks involving extensive context, such as reading comprehension and document ranking.
Theoretical Implications:
- The introduction of permutation LLMing bridges a gap between LLMing and efficient pretraining. This development justifies further research in density estimation approaches and their applicability to NLP tasks.
- The two-stream attention mechanism provides a new direction for designing target-aware representations in transformer-based architectures.

Conclusion

XLNet represents an important methodological advancement in the domain of NLP by effectively merging the strengths of both autoregressive and autoencoding pretraining methods. Its innovative approaches, including the permutation LLMing objective and the integration of Transformer-XL, contribute to its superior performance across a range of tasks. Future research may explore further optimization of the permutation strategies and the development of even more efficient architectures inspired by XLNet’s framework. This paper marks a significant step forward in the ongoing evolution of pretraining methodologies in NLP.

The detailed architecture, novel mechanisms, and robust empirical performance detailed in this paper make it a key reference point for future innovations in pretraining techniques for language understanding.