Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Augmenting Language Models with Long-Term Memory (2306.07174v1)

Published 12 Jun 2023 in cs.CL

Abstract: Existing LLMs can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, LLMs Augmented with Long-Term Memory (LongMem), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LongMem can thus memorize long past context and use long-term memory for LLMing. The proposed memory retrieval module can handle unlimited-length context in its memory bank to benefit various downstream tasks. Typically, LongMem can enlarge the long-form memory to 65k tokens and thus cache many-shot extra demonstration examples as long-form memory for in-context learning. Experiments show that our method outperforms strong long-context models on ChapterBreak, a challenging long-context modeling benchmark, and achieves remarkable improvements on memory-augmented in-context learning over LLMs. The results demonstrate that the proposed method is effective in helping LLMs to memorize and utilize long-form contents. Our code is open-sourced at https://aka.ms/LongMem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.
  2. Improving language models by retrieving from trillions of tokens. ArXiv, abs/2112.04426, 2021.
  3. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  6. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  7. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  8. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7:535–547, 2021.
  9. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  10. N-gram nearest neighbor machine translation. arXiv preprint arXiv:2301.12866, 2023.
  11. RoBERTa: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019.
  12. Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058, 2004.
  13. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  14. Improving language understanding with unsupervised learning. 2018.
  15. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  16. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683, 2020.
  17. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  18. Language models are unsupervised multitask learners. 2019.
  19. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250, 2016.
  20. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022.
  21. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  22. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  23. Chapterbreak: A challenge dataset for long-range language models. arXiv preprint arXiv:2204.10878, 2022.
  24. Attention is all you need. In NIPS, 2017.
  25. Visually-augmented language modeling. arXiv preprint arXiv:2205.10178, 2022.
  26. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  27. Memorizing transformers. ArXiv, abs/2203.08913, 2022.
  28. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  29. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2):165–210, 2005.
  30. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  31. Task-oriented dialogue system as natural language generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2698–2703, 2022.
  32. XLNet: Generalized autoregressive pretraining for language understanding. In NeurIPS, 2019.
  33. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Weizhi Wang (18 papers)
  2. Li Dong (154 papers)
  3. Hao Cheng (190 papers)
  4. Xiaodong Liu (162 papers)
  5. Xifeng Yan (52 papers)
  6. Jianfeng Gao (344 papers)
  7. Furu Wei (291 papers)
Citations (68)
Youtube Logo Streamline Icon: https://streamlinehq.com