Augmenting Language Models with Long-Term Memory (2306.07174v1)
Abstract: Existing LLMs can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, LLMs Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for LLMing. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping LLMs to memorize and utilize long-form contents. Our code is open-sourced at https://aka.ms/LongMem.
- Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.
- Improving language models by retrieving from trillions of tokens. ArXiv, abs/2112.04426, 2021.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547, 2021.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
- N-gram nearest neighbor machine translation. arXiv preprint arXiv:2301.12866, 2023.
- RoBERTa: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
- Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058, 2004.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Improving language understanding with unsupervised learning. 2018.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2020.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
- Language models are unsupervised multitask learners. 2019.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250, 2016.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- Chapterbreak: A challenge dataset for long-range language models. arXiv preprint arXiv:2204.10878, 2022.
- Attention is all you need. In NIPS, 2017.
- Visually-augmented language modeling. arXiv preprint arXiv:2205.10178, 2022.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Memorizing transformers. ArXiv, abs/2203.08913, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2):165–210, 2005.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703, 2022.
- XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020.
- Weizhi Wang (18 papers)
- Li Dong (154 papers)
- Hao Cheng (190 papers)
- Xiaodong Liu (162 papers)
- Xifeng Yan (52 papers)
- Jianfeng Gao (344 papers)
- Furu Wei (291 papers)