Online Adaptation of Language Models with a Memory of Amortized Contexts (2403.04317v2)
Abstract: Due to the rapid generation and dissemination of information, LLMs quickly run out of date despite enormous development costs. To address the crucial need to keep models updated, online learning has emerged as a critical tool when utilizing LLMs for real-world applications. However, given the ever-expanding corpus of unseen documents and the large parameter space of modern LLMs, efficient adaptation is essential. To address these challenges, we propose Memory of Amortized Contexts (MAC), an efficient and effective online adaptation framework for LLMs with strong knowledge retention. We propose a feature extraction and memory-augmentation approach to compress and extract information from new documents into compact modulations stored in a memory bank. When answering questions, our model attends to and extracts relevant knowledge from this memory bank. To learn informative modulations in an efficient manner, we utilize amortization-based meta-learning, which substitutes an otherwise required optimization process with a single forward pass of the encoder. Subsequently, we learn to choose from and aggregate selected documents into a single modulation by conditioning on the question, allowing us to adapt a frozen LLM during test time without requiring further gradient updates. Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency. In addition, we show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations (RAGs). Code is available at: https://github.com/jihoontack/MAC.
- Amos, B. et al. Tutorial on amortized optimization. Foundations and Trends® in Machine Learning, 2023.
- Knowledge-augmented large language models for personalized contextual query suggestion. arXiv preprint arXiv:2311.06318, 2023.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Improved few-shot visual classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Spatial functa: Scaling functa to imagenet classification and generation. arXiv preprint arXiv:2302.03130, 2023.
- Token merging: Your vit but faster. In International Conference on Learning Representations, 2023.
- Memory efficient meta-learning with large images. In Advances in Neural Information Processing Systems, 2021.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
- Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Reading wikipedia to answer open-domain questions. In Annual Conference of the Association for Computational Linguistics, 2017.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Qlora: Efficient finetuning of quantized llms. In Advances in Neural Information Processing Systems, 2023.
- Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10, 2022.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.
- French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
- Tic-clip: Continual training of clip models. arXiv preprint arXiv:2310.16226, 2023.
- Conditional neural processes. In International Conference on Machine Learning, 2018a.
- Neural processes. arXiv preprint arXiv:1807.01622, 2018b.
- A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 2008.
- Versatile neural processes for learning implicit neural representations. In International Conference on Learning Representations, 2023.
- Retrieval augmented language model pre-training. In International Conference on Machine Learning, 2020.
- Hypernetworks. In International Conference on Learning Representations, 2017.
- Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, 2023.
- Long short-term memory. Neural computation, 1997.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Meta-learning online adaptation of language models. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Meta reinforcement learning as task inference. arXiv preprint arXiv:1905.06424, 2019.
- Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation. In Annual Conference of the Association for Computational Linguistics, 2023.
- Unsupervised dense information retrieval with contrastive learning. In Transactions on Machine Learning Research, 2022.
- Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 2023.
- Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. In Conference on Empirical Methods in Natural Language Processing, 2022a.
- Towards continual knowledge learning of language models. In International Conference on Learning Representations, 2022b.
- Dense passage retrieval for open-domain question answering. In Conference on Empirical Methods in Natural Language Processing, 2020.
- What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. In Conference on Empirical Methods in Natural Language Processing, 2021.
- Generalizable implicit neural representations via instance pattern composers. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Attentive neural processes. In International Conference on Learning Representations, 2019.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.
- Kuhn, R. Speech recognition and the frequency of recently used words: A modified Markov model for natural language. In Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics, 1988.
- What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20(7):512–534, 2016.
- Mind the gap: Assessing temporal generalization in neural language models. In Advances in Neural Information Processing Systems, 2021.
- Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
- Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods. arXiv preprint arXiv:2203.05227, 2022.
- Streamingqa: a benchmark for adaptation to new knowledge over time in question answering models. In International Conference on Machine Learning, 2022.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. In Annual Conference of the Association for Computational Linguistics, 2022.
- Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021.
- Att3d: Amortized text-to-3d object synthesis. In IEEE International Conference on Computer Vision, 2023.
- Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 1989.
- A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.
- Fast model editing at scale. In International Conference on Learning Representations, 2022a.
- Memory-based model editing at scale. In International Conference on Machine Learning, 2022b.
- Control of memory, active perception, and action in minecraft. In International Conference on Machine Learning, 2016.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Hypertuning: Toward adapting large language models without back-propagation. In International Conference on Machine Learning, 2023.
- Improving language understanding by generative pre-training. In preprint, 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
- Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In Conference on Empirical Methods in Natural Language Processing, 2016.
- Rei, M. Online representation learning in recurrent neural language models. In Conference on Empirical Methods in Natural Language Processing, 2015.
- Learning to reweight examples for robust deep learning. In International conference on machine learning, 2018.
- Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, 2019.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 2009.
- Experience replay for continual learning. In Advances in Neural Information Processing Systems, 2019.
- Sandhaus, E. The new york times annotated corpus. Linguistic Data Consortium, Philadelphia, 2008.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, 2016.
- Progress & compress: A scalable framework for continual learning. In International Conference on Machine Learning, 2018.
- Meta-learning sparse compression networks. Transactions on Machine Learning Research, 2022.
- Neural machine translation of rare words with subword units. In Annual Conference of the Association for Computational Linguistics, 2015.
- Meta-neighborhoods. In Advances in Neural Information Processing Systems, 2020.
- Prompting gpt-3 to be reliable. In International Conference on Learning Representations, 2023.
- Hierarchical context merging: Better long context understanding for pre-trained LLMs. In International Conference on Learning Representations, 2024.
- Lifelong robot learning. Robotics and Autonomous Systems, 1995.
- Functional regularisation for continual learning with gaussian processes. In International Conference on Learning Representations, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Archivalqa: A large-scale benchmark dataset for open-domain question answering over historical news collections. In International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.
- Augmenting language models with long-term memory. In Advances in Neural Information Processing Systems, 2023.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022.
- Memorizing transformers. In International Conference on Learning Representations, 2022.
- Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023.
- Metafun: Meta-learning with iterative functional updates. In International Conference on Machine Learning, 2020.
- Evaluation of chatgpt and microsoft bing ai chat performances on physics exams of vietnamese national high school graduation examination. arXiv preprint arXiv:2306.04538, 2023.
- Dynamic language models for streaming text. Transactions of the Association for Computational Linguistics, 2014.
- Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.