Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing (2401.04881v1)

Published 10 Jan 2024 in cs.CL

Abstract: As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Arij Al Adel. 2022. Global memory transformer for processing long documents. In International Conference on Neuroinformatics, pages 343–352. Springer.
  2. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  4. With a little help from your own past: Prototypical memory networks for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3021–3031.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  6. Optimizing retrieval-augmented reader models via token elimination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1506–1524.
  7. Unlimiformer: Long-range transformers with unlimited length input. In Conference on Neural Information Processing Systems (NeurIPS), New Orleans, USA.
  8. Improving language models by retrieving from trillions of tokens. arxiv. arXiv preprint arXiv:2112.04426.
  9. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
  10. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  11. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  13. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of ACL 2019: the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
  14. Pre-computed memory or on-the-fly encoding? a hybrid approach to retrieval augmentation makes the most of your compute. In International Conference on Machine Learning, pages 7329–7342. PMLR.
  15. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
  16. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
  17. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
  18. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  19. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
  20. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.
  21. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
  22. Xinting Huang and Nora Hollenstein. 2023. Long-range language modeling with selective cache. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4838–4858, Singapore. Association for Computational Linguistics.
  23. Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351.
  24. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  25. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
  26. MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2603–2614, Online. Association for Computational Linguistics.
  27. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
  28. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  29. Focus your attention (with adaptive IIR filters). In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12538–12549, Singapore. Association for Computational Linguistics.
  30. Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations.
  31. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
  32. Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427.
  33. Investigating efficiently extending transformers for long input summarization. arXiv preprint arXiv:2208.04347.
  34. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  36. Token turing machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19070–19081.
  37. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
  39. Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pages 9902–9912. PMLR.
  40. Ul2: Unifying language learning paradigms. In Proceedings of ICLR 2022: The Eleventh International Conference on Learning Representations.
  41. Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems.
  42. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  43. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174.
  44. Memformer: A memory-augmented transformer for sequence modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 308–318, Online only. Association for Computational Linguistics.
  45. Memorizing transformers. In Proceedings of ICLR 2022: The Eleventh International Conference on Learning Representations.
  46. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  47. Simple local attentions remain competitive for long-context tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1975–1986, Seattle, United States. Association for Computational Linguistics.
  48. Trams: Training-free memory selection for long-range language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4966–4972.
  49. Big bird: transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 17283–17297.
Citations (4)

Summary

  • The paper introduces the Attendre layer, which employs evicted query mechanisms and tailored caching policies to reduce memory usage while preserving long-context processing.
  • The model leverages LRA and LFA eviction policies that prioritize key-value pairs by relevance, achieving competitive TriviaQA performance with significantly smaller memory sizes.
  • Evaluation on memory sizes as low as 128 positions demonstrates that the proposed approach can surpass baseline models, indicating promising efficiency gains for Transformer architectures.

Introduction

Transformer-based LLMs are increasingly used for understanding and generating complex text. However, the sheer volume of data required by these models poses a challenge due to the quadratic computational cost associated with the Transformer's attention mechanism. Recent research suggests segmenting input sequences into chunks and processing them incrementally to handle longer sequences without a proportional increase in compute. An influential method in this direction comes from the "Memorizing Transformer," which utilizes a memory structure to store past attention keys and values so that current queries can reference them. The downside, though, is that this usually requires substantial memory, particularly when matching the performance of Transformers reading the entire input at once.

Proposed Solution

Addressing the memory bottleneck, the paper introduces new methods to shrink the required memory size while maintaining adaptability to various model architectures. Specifically, it utilizes caching eviction policies like LRU (Least Recently Used) and LFU (Least Frequently Used), tailored for the context of Transformer memories, termed LRA (Least Recently Attended) and LFA (Least Frequently Attended). These policies prioritize key-value pairs based on their relevance as determined by attention scores, not merely their recency or frequency.

Additionally, the paper presents the Attendre layer – a novel component facilitating a "wait-to-attend" mechanism that permits queries to access future context. This development assists models like the encoder-decoder architecture, which traditionally relies on bidirectional attention, to also incorporate longer sequences efficiently.

Implementation and Evaluation

The authors examined their methods on the TriviaQA reading comprehension task, testing with memory sizes as small as 128 positions, and found that their evicting policies can enable comparable results to the baseline while requiring far less memory (e.g., a 2,048 position memory). Research indicates that the model augmented with the Attendre layer can indeed surpass the original model's performance, which processes entire long sequences at once.

Conclusion and Future Directions

The implications of this research extend far beyond computational savings. By equipping LLMs with more capable memory mechanisms, the Attendre layer can facilitate more natural and effective bidirectional attention over extended contexts. The next steps include expanding trials over a broader spectrum of tasks, fine-tuning the memory to be more task-adaptable, and addressing the gradient propagation challenge within the Q memory. The continued development could also explore compressing the encoder output memory in encoder-decoder architectures as a means to achieve even greater efficiency.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 14 tweets and received 11 likes.

Upgrade to Pro to view all of the tweets about this paper: