MASS: Masked Sequence to Sequence Pre-training for Language Generation (1905.02450v5)

Published 7 May 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks. MASS adopts the encoder-decoder framework to reconstruct a sentence fragment given the remaining part of the sentence: its encoder takes a sentence with randomly masked fragment (several consecutive tokens) as input, and its decoder tries to predict this masked fragment. In this way, MASS can jointly train the encoder and decoder to develop the capability of representation extraction and LLMing. By further fine-tuning on a variety of zero/low-resource language generation tasks, including neural machine translation, text summarization and conversational response generation (3 tasks and totally 8 datasets), MASS achieves significant improvements over the baselines without pre-training or with other pre-training methods. Specially, we achieve the state-of-the-art accuracy (37.5 in terms of BLEU score) on the unsupervised English-French translation, even beating the early attention-based supervised model.

Citations (935)

View on Semantic Scholar

Summary

The paper introduces MASS, a novel method that jointly trains the encoder and decoder through masked token prediction.
It achieves state-of-the-art BLEU scores in unsupervised translation and significantly outperforms baselines in low-resource summarization.
Its design enhances contextual understanding and lowers perplexity in dialogue generation, demonstrating versatility across NLP tasks.

An Overview of "MASS: Masked Sequence to Sequence Pre-training for Language Generation"

The paper "MASS: Masked Sequence to Sequence Pre-training for Language Generation" presents a novel approach for pre-training sequence-to-sequence models specifically for language generation tasks. This approach, termed as MAsked Sequence to Sequence (MASS), integrates an encoder-decoder framework where the encoder receives a sentence with masked fragments, and the decoder predicts these masked fragments. This design aims to develop robust representation extraction and LLMing capabilities jointly within the encoder and decoder.

Methodology

The MASS method applies a dual-stage process to pre-train models using large-scale monolingual corpora before fine-tuning on downstream tasks. Specifically, the encoder ingests a sentence with a masked fragment, and the decoder attempts to reconstruct these masked tokens. Unlike previous methods like BERT or standard LLMing which pre-train only the encoder or the decoder independently, MASS ensures that both function cooperatively from the outset. This joint training encapsulates:

Contextual Understanding: Forcing the encoder to comprehend the unmasked parts of the sentence to predict the masked segments.
LLMing: Training the decoder to leverage the encoder’s representation and not rely solely on the preceding tokens for next-token prediction.

Experimental Setup

The authors conducted extensive experiments across three distinct language generation tasks: Neural Machine Translation (NMT), text summarization, and conversational response generation. Each task was explored under both low-resource and zero-resource (unsupervised) scenarios, emphasizing MASS’s capability to handle diverse data constraints.

Neural Machine Translation:

For NMT, MASS was tested on multiple language pairs including English-French, English-German, and English-Romanian. The pre-trained MASS model showed significant improvements over baseline methods, achieving state-of-the-art BLEU scores of 37.5 for unsupervised English-French translation, thereby surpassing the previous state-of-the-art by more than 4 BLEU points.

Text Summarization:

On the Gigaword corpus, the MASS pre-training approach yielded substantial performance boosts in ROUGE scores, especially in low-resource settings, where the model trained on just 10,000 pairs outperformed the baseline by over 10 points in ROUGE-1, ROUGE-2, and ROUGE-L scores.

Conversational Response Generation:

Experiments on the Cornell Movie Dialog corpus underscored the efficacy of MASS for dialogue generation. Notably, MASS demonstrated lower perplexity in generating conversational responses compared to non-pre-trained baselines, illustrating its potential in practical applications.

Comparative Analysis and Ablations

In direct comparisons with other pre-training paradigms such as BERT+LM and DAE (Denoising Auto-Encoder), MASS consistently achieved superior results. The ablation studies further confirmed the design choices of MASS:

Predicting consecutive tokens: Ensured better language flow and coherence.
Masking decoder inputs: Encouraged the decoder to derive richer information from the encoder, refining the generation quality.

Implications and Future Directions

The implications of MASS for the field of NLP are multifaceted:

Enhanced Low-Resource Adaptability: Demonstrated improvements in scenarios with limited training data, suggesting MASS's utility for languages and tasks with scarce annotated resources.
Broad Task Applicability: The versatility shown across NMT, text summarization, and conversational response generation underlines MASS’s generalizability for various sequence-to-sequence tasks.

Future developments may focus on extending MASS to other sequence generation tasks such as sentence paraphrasing, text style transfer, and post-editing. Additionally, advancing the theoretical frameworks underpinning MASS can provide deeper insights into its operational mechanisms and optimization.

Conclusion

The "MASS: Masked Sequence to Sequence Pre-training for Language Generation" paper introduces a robust and versatile pre-training method that has established new benchmarks in several NLP tasks. By integrating a novel encoder-decoder training mechanism, MASS exemplifies how targeted pre-training can significantly elevate model performance in both data-rich and data-scarce environments. The findings encourage further exploration and application of MASS to a broader range of sequence generation challenges in NLP.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

Tweets

https://twitter.com/Orlangur_8eyed/status/1825231006648914077

YouTube

Show All Videos