Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Adaptation of Language Models with a Memory of Amortized Contexts (2403.04317v2)

Published 7 Mar 2024 in cs.LG and cs.CL

Abstract: Due to the rapid generation and dissemination of information, LLMs quickly run out of date despite enormous development costs. To address the crucial need to keep models updated, online learning has emerged as a critical tool when utilizing LLMs for real-world applications. However, given the ever-expanding corpus of unseen documents and the large parameter space of modern LLMs, efficient adaptation is essential. To address these challenges, we propose Memory of Amortized Contexts (MAC), an efficient and effective online adaptation framework for LLMs with strong knowledge retention. We propose a feature extraction and memory-augmentation approach to compress and extract information from new documents into compact modulations stored in a memory bank. When answering questions, our model attends to and extracts relevant knowledge from this memory bank. To learn informative modulations in an efficient manner, we utilize amortization-based meta-learning, which substitutes an otherwise required optimization process with a single forward pass of the encoder. Subsequently, we learn to choose from and aggregate selected documents into a single modulation by conditioning on the question, allowing us to adapt a frozen LLM during test time without requiring further gradient updates. Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency. In addition, we show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations (RAGs). Code is available at: https://github.com/jihoontack/MAC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Amos, B. et al. Tutorial on amortized optimization. Foundations and Trends® in Machine Learning, 2023.
  2. Knowledge-augmented large language models for personalized contextual query suggestion. arXiv preprint arXiv:2311.06318, 2023.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Improved few-shot visual classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  5. Spatial functa: Scaling functa to imagenet classification and generation. arXiv preprint arXiv:2302.03130, 2023.
  6. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023.
  7. Memory efficient meta-learning with large images. In Advances in Neural Information Processing Systems, 2021.
  8. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  9. Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization. In Conference on Empirical Methods in Natural Language Processing, 2023.
  10. Reading wikipedia to answer open-domain questions. In Annual Conference of the Association for Computational Linguistics, 2017.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  12. Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems, 2023.
  13. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10, 2022.
  14. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
  15. French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  16. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  17. Tic-clip: Continual training of clip models. arXiv preprint arXiv:2310.16226, 2023.
  18. Conditional neural processes. In International Conference on Machine Learning, 2018a.
  19. Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
  20. A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 2008.
  21. Versatile neural processes for learning implicit neural representations. In International Conference on Learning Representations, 2023.
  22. Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020.
  23. Hypernetworks. In International Conference on Learning Representations, 2017.
  24. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, 2023.
  25. Long short-term memory. Neural computation, 1997.
  26. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  27. Meta-learning online adaptation of language models. In Conference on Empirical Methods in Natural Language Processing, 2023.
  28. Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019.
  29. Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation. In Annual Conference of the Association for Computational Linguistics, 2023.
  30. Unsupervised dense information retrieval with contrastive learning. In Transactions on Machine Learning Research, 2022.
  31. Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 2023.
  32. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Conference on Empirical Methods in Natural Language Processing, 2022a.
  33. Towards continual knowledge learning of language models. In International Conference on Learning Representations, 2022b.
  34. Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing, 2020.
  35. What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. In Conference on Empirical Methods in Natural Language Processing, 2021.
  36. Generalizable implicit neural representations via instance pattern composers. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  37. Attentive neural processes. In International Conference on Learning Representations, 2019.
  38. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  39. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.
  40. Kuhn, R. Speech recognition and the frequency of recently used words: A modified Markov model for natural language. In Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics, 1988.
  41. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016.
  42. Mind the gap: Assessing temporal generalization in neural language models. In Advances in Neural Information Processing Systems, 2021.
  43. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
  44. Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods. arXiv preprint arXiv:2203.05227, 2022.
  45. Streamingqa: a benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, 2022.
  46. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  47. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. In Annual Conference of the Association for Computational Linguistics, 2022.
  48. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021.
  49. Att3d: Amortized text-to-3d object synthesis. In IEEE International Conference on Computer Vision, 2023.
  50. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 1989.
  51. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
  52. Fast model editing at scale. In International Conference on Learning Representations, 2022a.
  53. Memory-based model editing at scale. In International Conference on Machine Learning, 2022b.
  54. Control of memory, active perception, and action in minecraft. In International Conference on Machine Learning, 2016.
  55. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  56. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  57. Hypertuning: Toward adapting large language models without back-propagation. In International Conference on Machine Learning, 2023.
  58. Improving language understanding by generative pre-training. In preprint, 2018.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  60. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
  61. Squad: 100,000+ questions for machine comprehension of text. In Conference on Empirical Methods in Natural Language Processing, 2016.
  62. Rei, M. Online representation learning in recurrent neural language models. In Conference on Empirical Methods in Natural Language Processing, 2015.
  63. Learning to reweight examples for robust deep learning. In International conference on machine learning, 2018.
  64. Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, 2019.
  65. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 2009.
  66. Experience replay for continual learning. In Advances in Neural Information Processing Systems, 2019.
  67. Sandhaus, E. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 2008.
  68. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  69. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, 2016.
  70. Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, 2018.
  71. Meta-learning sparse compression networks. Transactions on Machine Learning Research, 2022.
  72. Neural machine translation of rare words with subword units. In Annual Conference of the Association for Computational Linguistics, 2015.
  73. Meta-neighborhoods. In Advances in Neural Information Processing Systems, 2020.
  74. Prompting gpt-3 to be reliable. In International Conference on Learning Representations, 2023.
  75. Hierarchical context merging: Better long context understanding for pre-trained LLMs. In International Conference on Learning Representations, 2024.
  76. Lifelong robot learning. Robotics and Autonomous Systems, 1995.
  77. Functional regularisation for continual learning with gaussian processes. In International Conference on Learning Representations, 2020.
  78. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  79. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  80. Archivalqa: A large-scale benchmark dataset for open-domain question answering over historical news collections. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.
  81. Augmenting language models with long-term memory. In Advances in Neural Information Processing Systems, 2023.
  82. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022.
  83. Memorizing transformers. In International Conference on Learning Representations, 2022.
  84. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023.
  85. Metafun: Meta-learning with iterative functional updates. In International Conference on Machine Learning, 2020.
  86. Evaluation of chatgpt and microsoft bing ai chat performances on physics exams of vietnamese national high school graduation examination. arXiv preprint arXiv:2306.04538, 2023.
  87. Dynamic language models for streaming text. Transactions of the Association for Computational Linguistics, 2014.
  88. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
Citations (11)

Summary

  • The paper introduces MAC, a framework that efficiently adapts LLMs online by compressing new information into memory modulations.
  • It leverages amortization-based meta-learning with backpropagation dropout to minimize training costs and manage memory constraints.
  • Empirical evaluations demonstrate MAC’s superior capability in preserving knowledge and reducing adaptation time and GPU memory usage compared to existing methods.

Online Adaptation of LLMs with a Memory of Amortized Contexts

Overview

LLMs have rapidly become a cornerstone of contemporary NLP, driving improvements across a myriad of tasks and applications. However, the static nature of these models poses significant challenges in keeping their knowledge up-to-date, given the dynamic and evolving landscape of information. In light of this challenge, this paper introduces a Memory of Amortized Contexts (MAC), an innovative online adaptation framework designed to efficiently and effectively update LLMs to incorporate new information without the need for extensive retraining.

Addressing the Online Adaptation Challenge

Online adaptation of LLMs is a critical problem, especially for applications necessitating up-to-the-minute information. Traditional approaches, including retrieval-augmented models and gradient-based online fine-tuning, each carry limitations such as computational inefficiency, potential loss of previously acquired knowledge (catastrophic forgetting), or limited applicability in memory-restrained settings. In contrast, MAC proposes an efficacious solution, leveraging amortized feature extraction and memory augmentation to compress new information into compact modulations, which are then efficiently utilized to update a static LLM.

Methodology

At MAC's core is the use of amortization-based meta-learning, notably substituting traditional optimization processes with a more computationally efficient forward pass of an encoder. This is instrumental in generating a compact modulation for any new document, encapsulating relevant knowledge without necessitating direct adjustments to the LLM's parameters. Subsequently, relevancy-driven selection and aggregation of context modulations, conditioned on incoming queries, enable the model to dynamically adapt its responses based on preserved knowledge.

The framework introduces two memory-efficient techniques for both its training and inference phases:

  • Backpropagation Dropout: This technique mitigates memory demands during training by processing only a subset of documents for gradient computation, ensuring both manageability and efficiency.
  • Hierarchical Modulation Aggregation: This approach addresses memory constraints during inference by a divide-and-conquer strategy, iteratively aggregating information in manageable groups to derive a final, relevant modulation, thus significantly reducing GPU memory usage.

Empirical Validation

MAC's efficacy is comprehensively validated across multiple datasets and model architectures, showcasing superior online adaptation performance, notable for both its accuracy and efficiency compared to existing methods. Experiments highlight MAC's ability to retain knowledge effectively through its adaptation process, underlining its practical utility for real-world applications.

Furthermore, efficiency evaluations elucidate MAC's advantage in memory and computational resource utilization, an essential consideration given the prevailing constraints associated with deploying large-scale models. These findings underscore MAC’s potential in significantly reducing adaptation time and memory usage without compromising on performance.

Conclusion and Future Directions

This paper's exploration and subsequent introduction of MAC reiterates the importance of efficient and effective online adaptation for LLMs. By addressing the limitations of existing approaches and presenting a robust framework, MAC paves the way for more dynamic, up-to-date, and efficient utilization of LLMs in various applications. Future research avenues might include exploring MAC's applicability in federated learning contexts or implementing privacy-preserving mechanisms for sensitive information in the documented memory bank, further broadening MAC's utility and applicability.