MPNet: Masked and Permuted Pre-training for Language Understanding (2004.09297v2)

Published 20 Apr 2020 in cs.CL and cs.LG

Abstract: BERT adopts masked LLMing (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted LLMing (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted LLMing (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.

PDF Abstract

MPNet: Masked and Permuted Pre-training for Language Understanding

The paper "MPNet: Masked and Permuted Pre-training for Language Understanding" introduces an innovative pre-training method for LLMs called MPNet. The key goal of MPNet is to address the inherent limitations in the widely acclaimed masked LLMing (MLM) used in BERT and the permuted LLMing (PLM) utilized by XLNet. By leveraging the strengths of both MLM and PLM while mitigating their respective weaknesses, MPNet serves as a comprehensive approach to enhance the performance of LLMs on various NLP tasks.

Background and Motivation

MLM and PLM have significantly advanced the field of NLP pre-training. BERT, employing MLM, efficiently leverages bidirectional context but fails to capture dependencies among masked tokens, a limitation that XLNet aims to overcome with its PLM approach. However, XLNet introduces its own set of challenges, particularly the position discrepancy between pre-training and fine-tuning due to its inability to fully utilize sentence position information.

Methodology

MPNet innovatively combines the advantages of MLM and PLM by:

Modeling Token Dependency: Utilizing a permuted LLMing scheme similar to PLM, MPNet effectively captures dependencies among predicted tokens.
Incorporating Full Position Information: MPNet addresses position discrepancies by incorporating auxiliary position information as input, aligning pre-training with the conditions encountered during fine-tuning.

Unified View of MLM and PLM

The paper presents a novel unified view of MLM and PLM, positing that both methods can be perceived through a common lens of rearranged sequences. This perspective allows MPNet to incorporate the benefits of both approaches, conditioning on non-predicted tokens, predicted tokens, and essential position information.

Implementation Details

MPNet is trained on a large-scale text corpus exceeding 160GB, using a configuration comparable to other state-of-the-art models like RoBERTa. The fine-tuning process encompasses a variety of downstream tasks, including GLUE, SQuAD, RACE, and IMDB benchmarks, to demonstrate the efficacy of the pre-training method.

Experimental Results

The experimental results underline MPNet's significant performance improvements in comparison to its predecessors. Notably:

On the GLUE benchmark, MPNet exhibits an average improvement of 4.8, 3.4, and 1.5 points over BERT, XLNet, and RoBERTa, respectively.
MPNet also demonstrates superior performance on the SQuAD datasets, with substantial improvements in both exact match (EM) and F1 scores.

Ablation Studies

The paper includes rigorous ablation studies to validate the contributions of different components of MPNet. Key findings include:

Position compensation effectively reduces the discrepancy between pre-training and fine-tuning, enhancing performance across tasks.
Incorporating the permutation operation and modeling token dependency further refines the model's predictive capabilities.

Implications and Future Directions

The results of MPNet indicate crucial advancements in the pre-training of LLMs. The incorporation of comprehensive position information and dependency modeling can lead to models that are not only more accurate but also more robust across diverse NLP tasks. Future research could explore extending MPNet to more complex architectures and investigating its applicability to an even broader spectrum of language understanding and generation tasks.

Conclusion

MPNet represents a significant methodological advancement by bridging the gap between MLM and PLM, providing a more holistic and effective pre-training approach. The empirical results underscore its potential, positioning MPNet as a critical development for future LLM research and application. This paper's systematic combination of innovative ideas and rigorous validation sets a new benchmark in the continuous evolution of NLP technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Kaitao Song (46 papers)
Xu Tan (164 papers)
Tao Qin (201 papers)
Jianfeng Lu (273 papers)
Tie-Yan Liu (242 papers)

Citations (941)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/MPNet: MPNet: Masked and Permuted Pre-training for Language Understanding https://arxiv.org/pdf/2004.09297.pdf (293 stars)

Tweets

https://twitter.com/TheShubhanshu/status/1799644884787494979