Mitigating the Impact of False Negatives in Dense Retrieval with Contrastive Confidence Regularization (2401.00165v2)
Abstract: In open-domain Question Answering (QA), dense retrieval is crucial for finding relevant passages for answer generation. Typically, contrastive learning is used to train a retrieval model that maps passages and queries to the same semantic space. The objective is to make similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative issue, where relevant passages may be missed during data annotation. Hard negative sampling, which is commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those closer to a given query, and thus more likely to be false negatives. To address this issue, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used loss for dense retrieval. Our analysis shows that the regularizer helps dense retrieval models be more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems.
- Improving Language Models by Retrieving from Trillions of Tokens. In International Conference on Machine Learning.
- Learning with Instance-Dependent Label Noise: A Sample Sieve Approach. In International Conference on Learning Representations.
- Debiased contrastive learning. Advances in Neural Information Processing Systems.
- IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on EMNLP. Association for Computational Linguistics.
- Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots. In Findings of the Association for Computational Linguistics: EMNLP.
- Is your language model ready for dense representation fine-tuning. arXiv preprint arXiv:2104.08253.
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of NAACL: Human Language Technologies.
- Re2G: Retrieve, Rerank, Generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning.
- A Model-Agnostic approach for learning with noisy labels of arbitrary distributions. In International Conference on Data Engineering. IEEE.
- Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In CIKM.
- Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In International Conference on Learning Representations.
- Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data, 535–547.
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on EMNLP. Association for Computational Linguistics.
- Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems.
- Prod: Progressive distillation for dense retrieval. In Proceedings of the ACM Web Conference.
- Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Peer loss functions: Learning from noisy labels without knowing noise rates. In International Conference on Machine Learning. PMLR.
- Ernie-search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153.
- Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics.
- Noise tolerance under risk minimization. IEEE Transactions on Cybernetics.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- Learning with noisy labels. Advances in Neural Information Processing Systems.
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. choice.
- Mitigating False-Negative Contexts in Multi-document Question Answering with Retrieval Marginalization. In Proceedings of the 2021 Conference on EMNLP. Association for Computational Linguistics.
- Domain-matched Pre-training Tasks for Dense Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022-Findings. Association for Computational Linguistics.
- Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on EMNLP-IJCNLP. Association for Computational Linguistics.
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on EMNLP.
- Contrastive Learning with Hard Negative Samples. In International Conference on Learning Representations.
- End-to-End Training of Neural Retrievers for Open-Domain Question Answering. In Proceddings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
- Contextual masked auto-encoder for dense passage retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems.
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In 9th International Conference on Learning Representations.
- Is retriever merely an approximator of reader? arXiv preprint arXiv:2010.10999.
- Estimating instance-dependent Bayes-label transition matrix using a deep neural network. In International Conference on Machine Learning. PMLR.
- Dual t: Reducing estimation error for transition matrix in label-noise learning. Advances in Neural Information Processing Systems.
- Adversarial Retriever-Ranker for Dense Text Retrieval. In International Conference on Learning Representations.
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Coarse-To-Fine Knowledge Selection for Document Grounded Dialogs. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing.
- SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. In Proceedings of the 2022 Conference on EMNLP: Industry Track.
- A second-order approach to learning with instance-dependent label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Shiqi Wang (162 papers)
- Yeqin Zhang (5 papers)
- Cam-Tu Nguyen (15 papers)