Adapting Pretrained Text-to-Text Models for Long Text Sequences (2209.10052v2)

Published 21 Sep 2022 in cs.CL

Abstract: We present an empirical study of adapting an existing pretrained text-to-text model for long-sequence inputs. Through a comprehensive study along three axes of the pretraining pipeline -- model architecture, optimization objective, and pretraining corpus, we propose an effective recipe to build long-context models from existing short-context models. Specifically, we replace the full attention in transformers with pooling-augmented blockwise attention, and pretrain the model with a masked-span prediction task with spans of varying length. In terms of the pretraining corpus, we find that using randomly concatenated short-documents from a large open-domain corpus results in better performance than using existing long document corpora which are typically limited in their domain coverage. With these findings, we build a long-context model that achieves competitive performance on long-text QA tasks and establishes the new state of the art on five long-text summarization datasets, often outperforming previous methods with larger model sizes. Our code has been released at https://github.com/facebookresearch/bart_ls.

PDF Abstract

Adapting Pretrained Text-to-Text Models for Long Text Sequences

The paper entitled "Adapting Pretrained Text-to-Text Models for Long Text Sequences" by Wenhan Xiong et al., explores strategies to leverage existing pretrained text-to-text models, typically optimized for short sequences, for processing and understanding long text inputs. The authors analyze three significant components of the pretraining pipeline: model architecture, optimization objective, and pretraining corpus, providing insights that culminate in a method for adapting models to long-sequence tasks without retraining from scratch.

Model Architecture Adjustments

To adapt the model architecture for long text sequences, the paper replaces the traditional full-attention mechanism in transformers with pooling-augmented blockwise attention. This modification addresses the quadratic complexity concern associated with full-attention mechanisms, making the model computationally efficient and improving its ability to capture long-range dependencies in the text. The authors also experimented with other long-range connection mechanisms, such as global tokens and overlapping attention windows, but determined that pooling-augmented blockwise attention provided the most consistent performance gains across a range of tasks.

Pretraining Objectives

The paper evaluates several pretraining objectives, including T5 span denoising, Pegasus's primary sentence prediction, and a novel model-based span prediction approach. Among these, the T5-style denoising task, especially with mixed span lengths, emerges as a favorable choice due to its simplicity and strong performance across tasks with varying output lengths. The paper underscores the importance of training models on longer sequences during pretraining to better prepare them for downstream tasks involving lengthy texts.

Pretraining Corpus

The selection of pretraining corpus plays a crucial role in the effectiveness of the long-context models. Contrary to the intuitive approach of using long-document corpora, which often suffer from domain limitations, the authors find that concatenating randomly selected short documents from a large open-domain corpus yields better performance. This method captures a broader range of language features and enhances model robustness across different domains.

Empirical Results

The proposed adaptation strategy sets a new state of the art in long-text summarization across five datasets, delivering over 10% relative improvements in ROUGE-2 on three datasets. Additionally, it demonstrates competitive performance in long-text question answering (QA) tasks, despite the modest size of the model compared to others such as LongT5. This achievement is particularly significant in that it highlights how existing pretrained models can be efficiently adapted for resource-intensive tasks involving long texts, without incurring the extensive costs associated with training entirely new models from scratch.

Implications and Future Directions

The findings have practical implications for efficiently broadening the applicability of state-of-the-art LLMs to long-form content in applications such as legal document processing, literature analysis, and extended discourse comprehension. Theoretical implications suggest further exploration into hybrid architectures that integrate insights from both blockwise attentions and approximate methods for achieving scalable LLMing. Future research could involve exploring pretraining techniques that dynamically adjust model biases depending on task-specific requirements and further refinement of span lengths across various corpora to balance learning efficiency and task performance.

In summary, this work provides a valuable contribution to the field of natural language processing by presenting a coherent and resource-efficient methodology for adapting existing text-to-text models to long-form inputs, broadening their applicability and maintaining competitive performance on complex downstream tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Wenhan Xiong (47 papers)
Anchit Gupta (21 papers)
Shubham Toshniwal (25 papers)
Yashar Mehdad (37 papers)
Wen-tau Yih (84 papers)

Citations (28)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - facebookresearch/bart_ls: Long-context pretrained encoder-decoder models (95 stars)