Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DSI++: Updating Transformer Memory with New Documents (2212.09744v3)

Published 19 Dec 2022 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents ($+12\%$). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031.
  2. Sharpness-aware minimization improves language model generalization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7360–7371.
  3. Extreme classification (dagstuhl seminar 18291). In Dagstuhl Reports, volume 8. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  4. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144.
  5. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240. PMLR.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  7. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486.
  8. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506.
  9. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385.
  10. Episodic memory in lifelong language learning. Advances in Neural Information Processing Systems, 32.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), page 4171–4186.
  12. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273.
  13. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
  14. REALM: Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
  15. Gautier Izacard and Édouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880.
  16. How can we know what language models know. Transactions of the Association for Computational Linguistics, 8:423–438.
  17. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  18. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
  19. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  20. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419.
  21. Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier.
  22. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50.
  23. Improving compositional generalization with self-training for data-to-text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4205–4219.
  24. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, volume 35.
  25. Understanding the role of training regimes in continual learning. Advances in Neural Information Processing Systems, 33:7308–7320.
  26. Fast model editing at scale. In International Conference on Learning Representations.
  27. MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016.
  28. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874.
  29. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71.
  30. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  31. How does generative retrieval scale to millions of passages? arXiv preprint arXiv:2305.11841.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  33. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  34. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
  35. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426.
  36. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR.
  37. Continual learning with deep generative replay. Advances in Neural Information Processing Systems, 30.
  38. An introduction to lifelong supervised learning. arXiv preprint arXiv:2207.04354.
  39. Memory-based parameter adaptation. In International Conference on Learning Representations.
  40. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650.
  41. LAMAL: LAnguage modeling is all you need for lifelong language learning. In International Conference on Learning Representations.
  42. Transformer memory as a differentiable search index. In Advances in Neural Information Processing Systems, volume 35.
  43. Sebastian Thrun. 1995. Is learning the n-th thing any easier than learning the first? Advances in Neural Information Processing Systems, 8.
  44. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations.
  45. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  46. Efficient meta lifelong-learning with limited memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 535–548.
  47. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.
  48. Bridging the gap between indexing and retrieval for differentiable search index with query generation. arXiv preprint arXiv:2206.10128.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sanket Vaibhav Mehta (14 papers)
  2. Jai Gupta (16 papers)
  3. Yi Tay (94 papers)
  4. Mostafa Dehghani (64 papers)
  5. Vinh Q. Tran (19 papers)
  6. Jinfeng Rao (17 papers)
  7. Marc Najork (27 papers)
  8. Emma Strubell (60 papers)
  9. Donald Metzler (49 papers)
Citations (34)
X Twitter Logo Streamline Icon: https://streamlinehq.com