Shortformer: Better Language Modeling using Shorter Inputs (2012.15832v2)

Published 31 Dec 2020 in cs.CL

Abstract: Increasing the input length has been a driver of progress in LLMing with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

PDF Abstract

Shortformer: Better LLMing Using Shorter Inputs

The paper "Shortformer: Better LLMing Using Shorter Inputs" challenges the prevailing assumption that longer input sequences invariably lead to superior performance in transformer-based LLMs. The researchers propose two primary innovations: staged training and position-infused attention, which allows transformers to effectively utilize shorter input lengths without compromising, and indeed sometimes improving, performance.

Key Contributions

This work introduces and rigorously tests methods for improving both the efficiency and performance of LLMs by leveraging shorter input sequences. The two key contributions are outlined below:

Staged Training: The authors propose a two-stage training process where models are first trained on shorter subsequences before progressing to longer ones. This approach is shown to reduce overall training time, decrease memory usage, and notably improve perplexity. The researchers attribute the improved perplexity to the reduced complexity experienced by the model during the initial stages of training, which may help the model better generalize.
Position-Infused Attention (PIA): The paper advances a novel approach to incorporating position information into the attention mechanism of transformers. By adding absolute position embeddings to the queries and keys rather than to the word embeddings, PIA enables the efficient reuse (caching) of representations from previous subsequences. This innovation eschews the need for computationally expensive relative position embeddings while retaining superior performance metrics, such as perplexity.

Experimental Validation

The authors provide comprehensive empirical validation of their methods using the WikiText-103 dataset, a well-known benchmark in natural language processing. The quantitative results reveal several significant improvements:

The proposed staged training can speed up the training process by a factor of 1.65 compared to a baseline model trained with conventional methods.
Both staged training and position-infused attention reduce the perplexity on the WikiText-103 dataset when compared to standard LLMs.
The Shortformer model, which combines staged training and position-infused attention, achieves a perplexity of approximately 17.47, outperforming the baseline (18.65) and demonstrating improved efficiency, as it utilizes attention matrices significantly smaller in size.

Implications and Future Directions

The findings of this paper hold considerable implications for the design and implementation of large-scale LLMs. By demonstrating that shorter input sequences can not only match but sometimes exceed the performance of longer ones, this research opens opportunities for more memory-efficient and quicker-train models. Moreover, with the growing demand for resource-efficient AI models, adopting such strategies may become particularly beneficial.

In terms of future developments, integrating the proposed methods with existing advanced models, like the Compressive Transformer or Routing Transformer, could potentially yield even more robust models. Moreover, while the current research focuses on LLMing, exploring the application of these methods in other sequential tasks, like video processing or time-series prediction, could further validate the applicability and effectiveness of these techniques.

Conclusion

This paper successfully argues against the assumption that longer input subsequences are inherently beneficial for transformer-based LLMs. By adopting innovative approaches such as staged training and position-infused attention, the researchers manage to enhance model efficiency and effectiveness, pushing the boundaries of what can be achieved with shorter input lengths. The Shortformer serves as a testament to the potential of rethinking conventional strategies in LLMing, paving the way for more adaptable and resource-conscious AI technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ofir Press (21 papers)
Noah A. Smith (224 papers)
Mike Lewis (78 papers)

Citations (84)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/andersonbcdefg/status/1786877037443260662

YouTube

Show All Videos