ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (2001.04063v3)

Published 13 Jan 2020 in cs.CL

Abstract: This paper presents a new sequence-to-sequence pre-training model called ProphetNet, which introduces a novel self-supervised objective named future n-gram prediction and the proposed n-stream self-attention mechanism. Instead of optimizing one-step-ahead prediction in the traditional sequence-to-sequence model, the ProphetNet is optimized by n-step ahead prediction that predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction explicitly encourages the model to plan for the future tokens and prevent overfitting on strong local correlations. We pre-train ProphetNet using a base scale dataset (16GB) and a large-scale dataset (160GB), respectively. Then we conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks. Experimental results show that ProphetNet achieves new state-of-the-art results on all these datasets compared to the models using the same scale pre-training corpus.

Citations (421)

View on Semantic Scholar

Summary

The paper presents a novel future n-gram prediction approach that improves Seq2Seq pre-training by predicting multiple tokens concurrently.
It introduces an innovative n-stream self-attention mechanism that captures long-term dependencies while maintaining efficient training and inference.
Experimental results show state-of-the-art performance in tasks like abstractive summarization and question generation compared to leading models.

An Overview of ProphetNet: Advancements in Sequence-to-Sequence Pre-training

The paper "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training" introduces a novel approach to improve sequence-to-sequence (Seq2Seq) models through a pre-training method that predicts future n-grams, rather than relying on the conventional one-step-ahead prediction techniques. This work highlights the limitations of autoregressive LLMs in capturing long-term dependencies and addresses these limitations with future n-gram prediction and an innovative n-stream self-attention mechanism.

Key Contributions and Methodology

1. Future N-gram Prediction:

ProphetNet enhances traditional Seq2Seq models by optimizing $n$ -step ahead predictions, effectively predicting multiple future tokens concurrently at each time step. This method mitigates overfitting to robust local correlations by encouraging the model to consider comprehensive future information beyond the immediate next token.

2. N-Stream Self-Attention Mechanism:

The authors extend the conventional two-stream self-attention to n-stream self-attention, accommodating the prediction of multiple tokens. This mechanism allows the model to efficiently learn during the training phase and adapt seamlessly to the inference or fine-tuning stages by switching to next-token prediction, maintaining operational simplicity and effectiveness.

3. Pre-training on Large-Scale Data:

ProphetNet is pre-trained on sizable datasets, notably a base 16GB dataset similar to BERT and an expanded 160GB set comparable to BART. Pre-training combines the future n-gram prediction with a mask-based auto-encoder denoising task, leveraging prior insights from related models like MASS and BART.

Experimental Results

ProphetNet demonstrates significant improvements across several benchmarks, achieving state-of-the-art results for abstractive summarization on CNN/DailyMail and Gigaword datasets, as well as question generation on SQuAD 1.1. Compared to similar scale models:

CNN/DailyMail: Outpaced BERTSUMABS, MASS, and UniLM achieving up to 43.68 in ROUGE-1.
Gigaword: Surpassed OpenNMT, Re3Sum, and leveraged minimally larger pre-train datasets to edge out architectures like PEGASUS that utilized orders of magnitude more data.

Implications and Future Directions

ProphetNet's future n-gram prediction and n-stream self-attention introduce a promising direction for balancing the encoding of long-term dependencies and local coherence in Seq2Seq models. The efficiency in training exhibited -- achieving superior performance with fewer epochs -- underscores a potential shift toward more cost-effective pre-training paradigms.

Potential avenues of future research include exploring variations in n-gram lengths to fine-tune model performance across different tasks and extending this methodology to other domains beyond textual data, where sequence prediction is pivotal. Additionally, integrating ProphetNet's approach into multi-modal models might prove beneficial, given its strength in capturing sequential dependencies.

In summary, ProphetNet marks a substantial step forward in refining pre-training methodologies for Seq2Seq models, offering a robust framework that enhances the prediction capabilities and performance outcomes in natural language generation tasks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (2001.04063v3)

Summary

An Overview of ProphetNet: Advancements in Sequence-to-Sequence Pre-training

Key Contributions and Methodology

Experimental Results

Implications and Future Directions

Follow-up Questions

Authors (8)

Tweets

ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training (2001.04063v3)

Summary

An Overview of ProphetNet: Advancements in Sequence-to-Sequence Pre-training

Key Contributions and Methodology

Experimental Results

Implications and Future Directions

Follow-up Questions

Related Papers

Authors (8)

Tweets