Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (2108.12409v2)

Published 27 Aug 2021 in cs.CL

Abstract: Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We first show that extrapolation can be enabled by simply changing the position representation method, though we find that current methods do not allow for efficient extrapolation. We therefore introduce a simpler and more efficient position method, Attention with Linear Biases (ALiBi). ALiBi does not add positional embeddings to word embeddings; instead, it biases query-key attention scores with a penalty that is proportional to their distance. We show that this method trains a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory. ALiBi's inductive bias towards recency also leads it to outperform multiple strong position methods on the WikiText-103 benchmark.

PDF Abstract

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

The paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" addresses a fundamental challenge in the deployment of transformer-based LLMs: achieving effective extrapolation to handle sequences longer than those encountered during training. This research introduces a novel position method, known as Attention with Linear Biases (ALiBi), that facilitates efficient extrapolation without compromising training or inference efficiency.

Introduction and Motivation

In transformer models, position encoding plays a crucial role, traditionally achieved through sinusoidal or learned embeddings. However, these methods exhibit limitations when applied to extrapolate longer sequences during inference, as identified by the research community since the inception of the transformer architecture by Vaswani et al. The authors rigorously analyze existing position methods such as sinusoidal and rotary embeddings, and the T5 bias, demonstrating their inefficacy in achieving satisfactory extrapolation.

ALiBi Position Method

The proposed ALiBi method introduces a bias in attention scores, which is proportional to the distance between query and key. By eliminating the need for positional embeddings altogether, ALiBi enhances training and inference processes by maintaining inductive bias towards recent tokens. The model is shown to be capable of training with shorter sequences yet effectively extrapolate on longer sequences during inference, thus reducing computational costs.

Key Experimentation and Results

Performance Metrics: Through experiments on the WikiText-103 dataset, ALiBi outperforms the sinusoidal method in terms of perplexity—indicating better LLMing performance—while being computationally efficient.
Efficiency Gains: A significant observation is that ALiBi models trained with a sequence length of 1024 tokens matched the performance of models trained with 2048 tokens when tested on the longer sequences, achieving this with 11% faster training and 11% less memory utilization.
Robustness Across Domains: ALiBi's applicability extends beyond WikiText-103 to domains such as the Toronto BookCorpus, demonstrating the robustness of the chosen hyperparameters for various text genres.
Scalability: At scale, with a 1.3 billion parameter model trained on a 461 GB dataset, ALiBi maintains its advantages, achieving near-equivalent perplexity to the baseline while offering practical efficiency benefits.

Implications and Future Directions

The research contributes a compelling alternative to traditional position encoding methods, particularly as models are increasingly employed for tasks requiring the processing of extensive contexts. The capability to probabilistically predict tokens over longer sequences through input length extrapolation opens avenues for more powerful and efficient NLP models, facilitating longer context understanding—crucial for applications like document-level LLMing and broader context-aware AI systems.

Future research could focus on further optimization of ALiBi to harness longer contexts fully. Investigating the integration with other advanced transformer architectures or hybrid models could amplify gains across different AI tasks. Moreover, theoretical examination of the inductive biases introduced by ALiBi might reveal insights into enhanced model performance for diverse linguistic structures and scenarios. The findings point towards potential enhancements in extrapolation tasks beyond NLP, such as music and image generation, where sequence lengths can significantly vary.

In summary, the introduction and validation of ALiBi provide an efficient pathway to address the limitations of transformer models in extrapolating input sequences, with promising implications for efficient model deployment and execution across a variety of large-scale, context-dependent applications.