Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization (1912.08777v3)

Published 18 Dec 2019 in cs.CL

Abstract: Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new self-supervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Introduction

The rapid advancements in NLP driven by transformative models such as Transformers have changed the landscape of text summarization significantly. While extractive summarization has its merits, abstractive summarization, which can generate novel and fluent text, remains challenging due to the intricate nature of capturing the essence of input documents. PEGASUS bridges this gap by introducing a novel pre-training approach, ultimately achieving state-of-the-art results across a diverse set of summarization tasks.

Methodology

Pre-training Objectives:

PEGASUS employs a unique pre-training objective called Gap Sentences Generation (GSG). Unlike standard Masked LLMs (MLM), GSG masks whole sentences deemed important, which are then generated from the remaining text. This task closely resembles the downstream summarization task, thus fostering an enhanced whole-document understanding and summary-like generation.

Sentence Selection Strategies:

The critical innovation lies in the methodology used to select gap sentences. Comprehensive evaluations concluded that determining principal sentences based on their importance (Ind-Orig) outperformed other methods such as random or sequential selection. This importance is heuristically computed using ROUGE1-F1 metric comparing a sentence against the rest of the document.

Pre-training Corpus:

PEGASUS was pre-trained on two substantial corpora, C4 and a newly introduced, domain-constrained HugeNews. The choice of corpus impacts performance, with HugeNews being particularly effective for news-related tasks and C4 offering broader domain coverage.

Experiments and Results

Empirical evaluations validated PEGASUS's efficacy across 12 distinct datasets, encompassing diverse domains such as news, science, legislative bills, and stories. PEGASUS consistently outperformed or matched existing state-of-the-art on standard ROUGE metrics. Notably, in low-resource settings, PEGASUS maintained robust performance, achieving high ROUGE scores with minimal supervised examples.

A critical observation during these experiments was the superior performance with datasets naturally aligned with the pre-training corpus in terms of domain content. This suggests a tailored pre-training corpus, aligned closely with the target application, can enhance results significantly.

Implications and Future Directions

The advancements pioneered by PEGASUS have profound implications:

  1. Fundamental Advances in Abstractive Summarization: The GSG objective represents a significant leap, aligning pre-training closer to the actual task, thus enhancing model robustness and output quality. Traditional pre-training models such as BERT and GPT do not explicitly cater to such task-specific requirements.
  2. Real-world Applicability: With its strong performance even on low-resource benchmarks, PEGASUS is primed for deployments in real-world applications where supervised data might be scarce. Given its efficiency in such scenarios, PEGASUS can be pivotal in democratizing access to high-quality summarization.
  3. Human-like Summarization: Human evaluations affirm the qualitative performance of PEGASUS, establishing it close to human-level performance in many instances. This bridges a critical gap in machine-generated content.

Future Directions:

An intriguing area for future research is further refining sentence selection heuristics to dynamically adapt to varied document structures across different domains. Additionally, an investigation into augmenting GSG with complementary pre-training objectives, potentially incorporating controlled generation techniques, could yield even richer summarization capabilities. The ongoing quest for balancing summarization quality with computational efficiency suggests explorations into more lightweight model variants suited for edge deployments.

Conclusion

PEGASUS exemplifies the enhanced capabilities in NLP achievable through tailored pre-training objectives. By masking and generating entire gap sentences, it closely emulates the summarization task during training, thereby honing its abstractive summarization prowess. The empirical results emphasize its superiority across diverse domains, including low-resource settings, making it a significant milestone and a promising foundation for future advancements in summarization technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jingqing Zhang (15 papers)
  2. Yao Zhao (272 papers)
  3. Mohammad Saleh (19 papers)
  4. Peter J. Liu (30 papers)
Citations (1,889)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com