Landmark Attention: Random-Access Infinite Context Length for Transformers (2305.16300v2)

Published 25 May 2023 in cs.CL and cs.LG

Abstract: While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity to over 32k tokens, allowing for inference at the context lengths of GPT-4. We release the implementation of landmark attention and the code to reproduce our experiments at https://github.com/epfml/landmark-attention/.

PDF Abstract

Overview of "Random-Access Infinite Context Length for Transformers"

The paper "Random-Access Infinite Context Length for Transformers" addresses the fundamental limitation faced by transformer models in handling extended context lengths due to their demanding memory requirements. The authors propose a novel methodology that maintains random-access capabilities within infinite context scenarios, enabling transformers to efficiently process longer sequences with reduced computational overhead.

Key Contributions

Landmark Token Approach: The paper introduces a landmark token mechanism within transformer architectures, allowing for the efficient selection and retrieval of relevant context blocks via the model's attention mechanism. This approach facilitates the retention of random-access flexibility without resorting to separate retrieval mechanisms.
Integration with Memory Hierarchies: The proposed method seamlessly integrates with existing data structures and memory systems, supporting the handling of arbitrary input lengths by significantly reducing the computational load at both training and inference phases.
Empirical Validation: The authors validate their approach against Transformer-XL, demonstrating comparable performance while significantly decreasing the number of tokens processed per step. Notably, the authors fine-tune LLaMA 7B with their method, extending the model's context capacity to over 32k tokens, matching the scale of GPT-4's context length.

Numerical Results

The experimentations reveal that the landmark token approach efficiently manages longer contexts with reduced attention size. The models show performance parity with established methods like Transformer-XL but with a noticeable reduction in computational demands. For instance, models trained with landmark tokens achieve comparable perplexity to Transformer-XL but exhibit a substantial processing time improvement by reducing the number of attention operations by factors aligning with the block sizes.

Theoretical Implications

This research underscores the potential modifications in the structure of transformer networks allowing them to exceed traditional context length restrictions. By advancing the attention mechanism to embed mechanisms for memory retrieval, this work proposes a shift in how transformers can manage context, influencing future research on scalable model architectures.

Practical Implications

In applied settings, particularly those requiring the handling of lengthy sequences, such as legal document analysis or genomic data interpretation, this method facilitates significant efficiency improvements. The reduction in memory and processing power required for inference presents immediate benefits for both academic and commercial applications, potentially lowering the resource barrier for deploying large-scale models in practice.

Future Research Directions

The landmark token approach invites additional exploration into hierarchical attention mechanisms and alternative data structures for further efficiency gains. Additionally, the proposal for positional augmentation strategies to aid in extrapolating positing encoding to longer sequences offers fertile ground for enhancing model generalization across unseen context lengths.

Conclusion

This paper presents a significant advancement in transformer scalability through its innovative deployment of the landmark token system for context retrieval. By focusing on efficient memory use and leveraging the existing architecture for retrievability, this work presents a robust framework for addressing the limitations of conventional transformer models in handling extended contexts. The promising results achieved through fine-tuning LLMs like LLaMA highlight the method's practical applicability and set a new benchmark for future research in transformer efficiency and scalability.