Introduction
Transformer models have made significant advances in NLP, but their ability to process long input sequences effectively remains a substantial challenge, particularly in tasks like long-input summarization. This work explores how to extend pretrained Transformer models to handle long input sequences more efficiently.
Architectural and Pretraining Modifications
Through a series of experiments, the paper identifies key architectural changes and pretraining strategies that improve the performance of Transformers on long input tasks. A variant of Transformer called "block-local attention with staggered blocks and global tokens" was found to strike an effective balance between computational efficiency and performance on long input summarization tasks. Key findings from the investigation include:
- Utilizing local attention with block-wise staggering allows for effective cross-block information flow without significant computational overhead.
- Introducing a small number of global tokens that can interact with all other tokens in the sequence enhances the model's ability to summarize effectively.
- Sinusoidal position encodings remain a good choice for long-input Transformer models, balancing performance with computational efficiency.
Adapting PEGASUS for Long-Input Summarization
Based on insights from the paper, PEGASUS-X is presented as an extension of the existing PEGASUS model, designed to summarize documents with up to 16K input tokens. This model adds few additional parameters and doesn't require complex model parallelism for training. PEGASUS-X includes a block-local Transformer with global tokens, additional pretraining on long sequences, and is tested on inputs longer than typical maximum input lengths.
Performance and Contributions
PEGASUS-X achieves strong results on long-input summarization tasks, matching or outperforming larger models while remaining more parameter-efficient. It also maintains comparable performance on shorter input summarization.
Summarizing the contributions:
- A systematic exploration of efficient Transformer architectures is conducted, yielding insights into architectural modifications and their effects on long-input summarization.
- A recipe for adapting pretrained Transformer encoders to longer inputs is proposed, improving long-document summarization with minimal impact on performance for shorter documents.
- PEGASUS-X is introduced, and the model weights are made available to the community for further research and application.
Conclusion
The paper concludes that properly adjusting Transformer architectures and pretraining strategies can significantly boost performance on long-input tasks without resorting to overly complex or compute-intensive solutions. PEGASUS-X exemplifies how these adaptations can be applied to existing models to extend their capabilities to cope with longer sequences effectively.