Investigating Efficiently Extending Transformers for Long Input Summarization (2208.04347v1)

Published 8 Aug 2022 in cs.CL

Abstract: While large pretrained Transformer models have proven highly capable at tackling natural language tasks, handling long sequence inputs continues to be a significant challenge. One such task is long input summarization, where inputs are longer than the maximum input context of most pretrained models. Through an extensive set of experiments, we investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global encoder tokens strikes a good balance of performance and efficiency, and that an additional pretraining phase on long sequences meaningfully improves downstream summarization performance. Based on our findings, we introduce PEGASUS-X, an extension of the PEGASUS model with additional long input pretraining to handle inputs of up to 16K tokens. PEGASUS-X achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.

PDF Abstract

Introduction

Transformer models have made significant advances in NLP, but their ability to process long input sequences effectively remains a substantial challenge, particularly in tasks like long-input summarization. This work explores how to extend pretrained Transformer models to handle long input sequences more efficiently.

Architectural and Pretraining Modifications

Through a series of experiments, the paper identifies key architectural changes and pretraining strategies that improve the performance of Transformers on long input tasks. A variant of Transformer called "block-local attention with staggered blocks and global tokens" was found to strike an effective balance between computational efficiency and performance on long input summarization tasks. Key findings from the investigation include:

Utilizing local attention with block-wise staggering allows for effective cross-block information flow without significant computational overhead.
Introducing a small number of global tokens that can interact with all other tokens in the sequence enhances the model's ability to summarize effectively.
Sinusoidal position encodings remain a good choice for long-input Transformer models, balancing performance with computational efficiency.

Adapting PEGASUS for Long-Input Summarization

Based on insights from the paper, PEGASUS-X is presented as an extension of the existing PEGASUS model, designed to summarize documents with up to 16K input tokens. This model adds few additional parameters and doesn't require complex model parallelism for training. PEGASUS-X includes a block-local Transformer with global tokens, additional pretraining on long sequences, and is tested on inputs longer than typical maximum input lengths.

Performance and Contributions

PEGASUS-X achieves strong results on long-input summarization tasks, matching or outperforming larger models while remaining more parameter-efficient. It also maintains comparable performance on shorter input summarization.

Summarizing the contributions:

A systematic exploration of efficient Transformer architectures is conducted, yielding insights into architectural modifications and their effects on long-input summarization.
A recipe for adapting pretrained Transformer encoders to longer inputs is proposed, improving long-document summarization with minimal impact on performance for shorter documents.
PEGASUS-X is introduced, and the model weights are made available to the community for further research and application.

Conclusion

The paper concludes that properly adjusting Transformer architectures and pretraining strategies can significantly boost performance on long-input tasks without resorting to overly complex or compute-intensive solutions. PEGASUS-X exemplifies how these adaptations can be applied to existing models to extend their capabilities to cope with longer sequences effectively.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jason Phang (40 papers)
Yao Zhao (272 papers)
Peter J. Liu (30 papers)

Citations (59)

View on Semantic Scholar