Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unlimiformer: Long-Range Transformers with Unlimited Length Input (2305.01625v3)

Published 2 May 2023 in cs.CL
Unlimiformer: Long-Range Transformers with Unlimited Length Input

Abstract: Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing pretrained encoder-decoder transformer, and offloads the cross-attention computation to a single k-nearest-neighbor (kNN) index, while the returned kNN distances are the attention dot-product scores. This kNN index can be kept on either the GPU or CPU memory and queried in sub-linear time; this way, we can index practically unlimited input sequences, while every attention head in every decoder layer retrieves its top-k keys, instead of attending to every key. We evaluate Unlimiformer on several long-document and book-summarization benchmarks, showing that it can process even 500k token-long inputs from the BookSum dataset, without any input truncation at test time. We demonstrate that Unlimiformer improves pretrained models such as BART and Longformer by extending them to unlimited inputs without additional learned weights and without modifying their code. We make our code and models publicly available at https://github.com/abertsch72/unlimiformer .

Unlimiformer: Long-Range Transformers with Unlimited Length Input

The paper "Unlimiformer: Long-Range Transformers with Unlimited Length Input" introduces a novel approach for extending the length of input sequences that transformers can process, without the commonly associated computational cost increase. Concretely, the authors introduce Unlimiformer, a method that leverages kk-nearest-neighbor (kkNN) retrieval to offload cross-attention computation. This approach theoretically and empirically removes the limitation on input length for transformers while maintaining their performance.

Key Contributions

The primary contribution of this paper is the Unlimiformer technique, which facilitates unlimited input length by incorporating the kkNN index. Rather than modifying the underlying architecture of existing transformers or requiring additional learned parameters, the Unlimiformer adapts existing encoder-decoder transformers. It does this by:

  1. Retrieval-based Cross-Attention: Cross-attention in traditional transformers necessitates attending to all tokens, which results in a quadratic complexity with respect to the input length. Unlimiformer addresses this by using kkNN indices to retrieve top-kk token embeddings relevant to the current decoding step. This reduces the complexity and allows attention over theoretically unlimited input sequences. The kkNN distances serve as the attention dot-product scores.
  2. Encoding and Indexing: The approach begins by encoding overlapping chunks of the input sequence using the encoder, followed by constructing a kkNN index over these encoded tokens. The attention heads in the decoder then utilize this index to perform sub-linear time queries, dynamically attending to different parts of the input as needed.

The paper substantiates the Unlimiformer's efficacy through comprehensive evaluation on long-document and book summarization tasks. Notably, the method is demonstrated to scale to input lengths of up to 500k tokens without input truncation during inference.

Numerical Results

Evaluations on various long-document summarization datasets, including the BookSum dataset, show that Unlimiformer significantly enhances the performance of base models such as BART and Longformer. Key numerical results include:

  • GovReport: Using Unlimiformer, the BART model achieves a ROUGE-1 score of 56.6 and a BERTScore of 68.2, outperforming both the base BART model and the Longformer-Encoder-Decoder (PRIMERA).
  • SummScreen: Unlimiformer improves the ROUGE-1 score of BART from 29.7 to 34.7.
  • BookSum: The method doubles the entity recall compared to the baseline, further emphasizing its capacity to retain and utilize extensive input contexts effectively.

Practical and Theoretical Implications

The practical implications of this research are multifaceted:

  • Compatibility with Pretrained Models: Unlimiformer can be integrated into any pre-existing pretrained encoder-decoder model without requiring additional training. This involves merely fine-tuning, augmenting performance with minimal computational overhead.
  • Scalability: The ability to handle arbitrarily long input sequences opens up new possibilities for tasks requiring processing large documents, such as legal document review and comprehensive literature summarization.
  • Computational Efficiency: Despite offering extended input capabilities, Unlimiformer’s reliance on sub-linear time complexity for kkNN queries ensures that it remains computationally efficient, both in terms of memory and processing speed.

Theoretically, the use of nearest-neighbor retrieval in recalibrating cross-attention for large inputs provides a novel perspective on managing transformer scalability. Moreover, by showing that 99% of the attention mass can be preserved through top-kk retrieval, the paper affirms that kkNN indices are robust mechanisms for attention in expansive contexts.

Future Developments in AI

The use of retrieval mechanisms like kkNN for attention in Unlimiformer highlights potential future directions in AI research:

  • Cross-modal Retrieval: The principles could be extended to attention in multimodal transformers, enabling efficient processing of extensive datasets that include text and images.
  • Adaptive Retrieval: Future models may explore adaptive retrieval strategies that dynamically adjust the value of kk based on input characteristics, potentially optimizing performance further.
  • Memory-Augmented Models: Building on the efficiency of Unlimiformer, research could delve into hybrid models integrating both dense and sparse memory retrieval to encapsulate rich contextual information efficiently.

In conclusion, the Unlimiformer represents a significant advancement in the field of transformer models by enabling the use of practically unlimited length input sequences. This capability, combined with its compatibility with pretrained models and efficient computational requirements, marks a meaningful progression in addressing the limitations of transformer architectures for long-range contextual processing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Amanda Bertsch (14 papers)
  2. Uri Alon (40 papers)
  3. Graham Neubig (342 papers)
  4. Matthew R. Gormley (22 papers)
Citations (105)
Youtube Logo Streamline Icon: https://streamlinehq.com