Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing (2401.04881v1)

Published 10 Jan 2024 in cs.CL

Abstract: As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.

References (49)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the Attendre layer, which employs evicted query mechanisms and tailored caching policies to reduce memory usage while preserving long-context processing.
The model leverages LRA and LFA eviction policies that prioritize key-value pairs by relevance, achieving competitive TriviaQA performance with significantly smaller memory sizes.
Evaluation on memory sizes as low as 128 positions demonstrates that the proposed approach can surpass baseline models, indicating promising efficiency gains for Transformer architectures.

Introduction

Transformer-based LLMs are increasingly used for understanding and generating complex text. However, the sheer volume of data required by these models poses a challenge due to the quadratic computational cost associated with the Transformer's attention mechanism. Recent research suggests segmenting input sequences into chunks and processing them incrementally to handle longer sequences without a proportional increase in compute. An influential method in this direction comes from the "Memorizing Transformer," which utilizes a memory structure to store past attention keys and values so that current queries can reference them. The downside, though, is that this usually requires substantial memory, particularly when matching the performance of Transformers reading the entire input at once.

Proposed Solution

Addressing the memory bottleneck, the paper introduces new methods to shrink the required memory size while maintaining adaptability to various model architectures. Specifically, it utilizes caching eviction policies like LRU (Least Recently Used) and LFU (Least Frequently Used), tailored for the context of Transformer memories, termed LRA (Least Recently Attended) and LFA (Least Frequently Attended). These policies prioritize key-value pairs based on their relevance as determined by attention scores, not merely their recency or frequency.

Additionally, the paper presents the Attendre layer – a novel component facilitating a "wait-to-attend" mechanism that permits queries to access future context. This development assists models like the encoder-decoder architecture, which traditionally relies on bidirectional attention, to also incorporate longer sequences efficiently.

Implementation and Evaluation

The authors examined their methods on the TriviaQA reading comprehension task, testing with memory sizes as small as 128 positions, and found that their evicting policies can enable comparable results to the baseline while requiring far less memory (e.g., a 2,048 position memory). Research indicates that the model augmented with the Attendre layer can indeed surpass the original model's performance, which processes entire long sequences at once.

Conclusion and Future Directions

The implications of this research extend far beyond computational savings. By equipping LLMs with more capable memory mechanisms, the Attendre layer can facilitate more natural and effective bidirectional attention over extended contexts. The next steps include expanding trials over a broader spectrum of tasks, fine-tuning the memory to be more task-adaptable, and addressing the gradient propagation challenge within the Q memory. The continued development could also explore compressing the encoder output memory in encoder-decoder architectures as a means to achieve even greater efficiency.