Equipping Transformer with Random-Access Reading for Long-Context Understanding (2405.13216v1)
Abstract: Long-context modeling presents a significant challenge for transformer-based LLMs due to the quadratic complexity of the self-attention mechanism and issues with length extrapolation caused by pretraining exclusively on short inputs. Existing methods address computational complexity through techniques such as text chunking, the kernel approach, and structured attention, and tackle length extrapolation problems through positional encoding, continued pretraining, and data engineering. These approaches typically require $\textbf{sequential access}$ to the document, necessitating reading from the first to the last token. We contend that for goal-oriented reading of long documents, such sequential access is not necessary, and a proficiently trained model can learn to omit hundreds of less pertinent tokens. Inspired by human reading behaviors and existing empirical observations, we propose $\textbf{random access}$, a novel reading strategy that enables transformers to efficiently process long documents without examining every token. Experimental results from pretraining, fine-tuning, and inference phases validate the efficacy of our method.
- Longformer: The long-document transformer. arXiv:2004.05150, 2020.
- Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029, 2023.
- Rethinking attention with performers. In International Conference on Learning Representations, 2020.
- Monotonic location attention for length generalization. In International Conference on Machine Learning. PMLR, 2023.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, jun 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems, 33:12792–12804, 2020.
- Confidence modeling for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 743–753, 2018.
- A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502, 2023.
- Simple hardware-efficient long convolutions for sequence modeling. In International Conference on Machine Learning. PMLR, 2023.
- Data engineering for scaling language models to 128k context. In International Conference on Machine Learning, 2024.
- Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 724–736, 2022.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351, 2023.
- Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977, 2021.
- SWE-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2023.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017.
- Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5684–5696, 2020.
- Knuth, D. E. The art of computer programming, volume 3. Pearson Education, 1997.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Leveraging locality in abstractive text summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6081–6093, 2022.
- Deja vu: Contextual sparsity for efficient LLMs at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023. URL https://proceedings.mlr.press/v202/liu23am.html.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
- Landmark attention: Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 2023.
- Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Narrative question answering with cutting-edge open-domain qa techniques: A comprehensive study. Transactions of the Association for Computational Linguistics, 9:1032–1046, 2021.
- Fourierformer: Transformer meets generalized fourier integral theorem. Advances in Neural Information Processing Systems, 35:29319–29335, 2022.
- The development of strategic readers. 1991.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Random feature attention. In International Conference on Learning Representations, 2020.
- Verbal protocols of reading: The nature of constructively responsive reading. Routledge, 2012.
- Blockwise self-attention for long document understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555–2565, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Sparse sinkhorn attention. In International Conference on Machine Learning, pp. 9438–9447. PMLR, 2020a.
- Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020b.
- Memorizing transformers. In International Conference on Learning Representations, 2021.
- Efficient streaming language models with attention sinks. In International Conference on Learning Representations, 2024.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Can you follow me? testing situational understanding for ChatGPT. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6385–6398, 2023.
- Attendre: Wait to attend by retrieval with evicted queries in memory-based transformers for long context processing. arXiv preprint arXiv:2401.04881, 2024.
- Summn: A multi-stage summarization framework for long input dialogues and documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1592–1604, 2022.
- Webarena: A realistic web environment for building autonomous agents. In International Conference on Learning Representations, 2023.
- Chenghao Yang (25 papers)
- Zi Yang (33 papers)
- Nan Hua (14 papers)