A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks (2409.01890v1)
Abstract: In dense retrieval, deep encoders provide embeddings for both inputs and targets, and the softmax function is used to parameterize a distribution over a large number of candidate targets (e.g., textual passages for information retrieval). Significant challenges arise in training such encoders in the increasingly prevalent scenario of (1) a large number of targets, (2) a computationally expensive target encoder model, (3) cached target embeddings that are out-of-date due to ongoing training of target encoder parameters. This paper presents a simple and highly scalable response to these challenges by training a small parametric corrector network that adjusts stale cached target embeddings, enabling an accurate softmax approximation and thereby sampling of up-to-date high scoring "hard negatives." We theoretically investigate the generalization properties of our proposed target corrector, relating the complexity of the network, staleness of cached representations, and the amount of training data. We present experimental results on large benchmark dense retrieval datasets as well as on QA with retrieval augmented LLMs. Our approach matches state-of-the-art results even when no target embedding updates are made during training beyond an initial cache from the unsupervised pre-trained model, providing a 4-80x reduction in re-embedding computational cost.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
- Adaptive sampled softmax with kernel based sampling. International Conference on Machine Learning (ICML), 2018.
- JAX: composable transformations of Python+NumPy programs. 2018.
- BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
- A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.
- Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
- Learning dense representations for entity retrieval. Conference on Computational Natural Language Learning (CoNLL), 2019.
- Learning to navigate the synthetically accessible chemical space using reinforcement learning. International conference on machine learning, 2020.
- Learning the stein discrepancy for training and evaluating energy-based models without sampling. International Conference on Machine Learning, 2020.
- No mcmc for me: Amortized sampling for fast and stable training of energy-based models. ICLR, 2021.
- Retrieval augmented language model pre-training. International Conference on Machine Learning (ICML), 2020.
- Joint training of variational auto-encoder and latent energy-based model. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning, 2019.
- Leveraging passage retrieval with generative models for open domain question answering, 2021.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics (TACL), 2019.
- Efficient training of retrieval models using negative cache. Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Zero-shot entity linking by reading entity descriptions. Association for Computational Linguistics (ACL), 2019.
- Iterative amortized inference. Proceedings of the 35th International Conference on Machine Learning, pp. 3403–3412, 2018.
- Foundations of machine learning. MIT press, 2018.
- Improving dual-encoder training through dynamic indexes for negative mining. AISTATS, 2023.
- Amortized rejection sampling in universal probabilistic programming. International Conference on Artificial Intelligence and Statistics, 2022.
- Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
- Large dual encoders are generalizable retrievers. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
- Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 2020.
- Sampled softmax with random fourier features. Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Doubly-stochastic mining for heterogeneous retrieval. arXiv preprint arXiv:2004.10915, 2020.
- Stochastic negative mining for learning with large output spaces. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019.
- How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
- Probabilistic structured predictors. Uncertainty in Artificial Intelligence (UAI), 2009.
- Villani, C. et al. Optimal transport: old and new. Springer, 2008.
- Retrieval of the best counterargument without prior topic knowledge. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2018.
- Cooperative learning of energy-based model and latent variable model via mcmc teaching. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 2018.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. International Conference on Learning Representations (ICLR), 2020.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
- Pecos: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research, 2022.
- Canopy fast sampling with cover trees. International Conference on Machine Learning (ICML), 2017.