Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing (2401.04881v1)
Abstract: As LLMs have become capable of processing more complex types of inputs, researchers have recently studied how to efficiently and affordably process possibly arbitrarily long sequences. One effective approach is to use a FIFO memory to store keys and values of an attention sublayer from past chunks to allow subsequent queries to attend. However, this approach requires a large memory and/or takes into the consideration the specific LM architecture. Moreover, due to the causal nature between the key-values in prior context and the queries at present, this approach cannot be extended to bidirectional attention such as in an encoder-decoder or PrefixLM decoder-only architecture. In this paper, we propose to use eviction policies, such as LRA and LFA, to reduce the memory size and adapt to various architectures, and we also propose the Attendre layer, a wait-to-attend mechanism by retrieving the key-value memory (K/V memory) with evicted queries in the query memory (Q memory). As a first step, we evaluate this method in the context length extension setup using the TriviaQA reading comprehension task, and show the effectiveness of the approach.
- Arij Al Adel. 2022. Global memory transformer for processing long documents. In International Conference on Neuroinformatics, pages 343–352. Springer.
- Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- With a little help from your own past: Prototypical memory networks for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3021–3031.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Optimizing retrieval-augmented reader models via token elimination. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1506–1524.
- Unlimiformer: Long-range transformers with unlimited length input. In Conference on Neural Information Processing Systems (NeurIPS), New Orleans, USA.
- Improving language models by retrieving from trillions of tokens. arxiv. arXiv preprint arXiv:2112.04426.
- Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of ACL 2019: the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- Pre-computed memory or on-the-fly encoding? a hybrid approach to retrieval augmentation makes the most of your compute. In International Conference on Machine Learning, pages 7329–7342. PMLR.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
- A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
- In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
- LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
- Xinting Huang and Nora Hollenstein. 2023. Long-range language modeling with selective cache. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4838–4858, Singapore. Association for Computational Linguistics.
- Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
- MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2603–2614, Online. Association for Computational Linguistics.
- Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- Focus your attention (with adaptive IIR filters). In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12538–12549, Singapore. Association for Computational Linguistics.
- Mega: Moving average equipped gated attention. In The Eleventh International Conference on Learning Representations.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
- Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427.
- Investigating efficiently extending transformers for long input summarization. arXiv preprint arXiv:2208.04347.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Token turing machines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19070–19081.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
- Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pages 9902–9912. PMLR.
- Ul2: Unifying language learning paradigms. In Proceedings of ICLR 2022: The Eleventh International Conference on Learning Representations.
- Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174.
- Memformer: A memory-augmented transformer for sequence modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 308–318, Online only. Association for Computational Linguistics.
- Memorizing transformers. In Proceedings of ICLR 2022: The Eleventh International Conference on Learning Representations.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Simple local attentions remain competitive for long-context tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1975–1986, Seattle, United States. Association for Computational Linguistics.
- Trams: Training-free memory selection for long-range language modeling. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4966–4972.
- Big bird: transformers for longer sequences. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 17283–17297.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.