Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Dense Retrievers' Robustness with Group-level Reweighting (2310.16605v4)

Published 25 Oct 2023 in cs.IR

Abstract: The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, unsupervised data contains diverse patterns across the web graph and often exhibits significant imbalance, leading to suboptimal performance in underrepresented or difficult groups. In this paper, we introduce WebDRO, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models. Initially, we build an embedding model for clustering anchor-document pairs. Specifically, we contrastively train the embedding model for link prediction, which guides the embedding model in capturing the document features behind the web graph links. Subsequently, we employ the group distributional robust optimization to recalibrate the weights across different clusters of anchor-document pairs during training retrieval models. During training, we direct the model to assign higher weights to clusters with higher loss and focus more on worst-case scenarios. This approach ensures that the model has strong generalization ability on all data patterns. Our experiments on MS MARCO and BEIR demonstrate that our method can effectively improve retrieval performance in unsupervised training and finetuning settings. Further analysis confirms the stability and validity of group weights learned by WebDRO. The code of this paper can be obtained from https://github.com/Hanpx20/GroupDRO_Dense_Retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22:98:1–98:76.
  2. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
  3. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324.
  4. A simple framework for contrastive learning of visual representations. In Proceedings of ICML, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR.
  5. Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one? In Proceedings of EMNLP Findings, pages 250–262, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Selection via proxy: Efficient data selection for deep learning. In Proceedings of ICLR. OpenReview.net.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. In Proc. of EMNLP, pages 981–993, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proc. of ACL, pages 2843–2853, Dublin, Ireland. Association for Computational Linguistics.
  10. Retrieval augmented language model pre-training. In Proceedings of ICML, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  11. Visualizing and understanding the effectiveness of BERT. In Proceedings of EMNLP, pages 4143–4152, Hong Kong, China. Association for Computational Linguistics.
  12. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735. IEEE.
  13. Flat minima. Neural computation, 9(1):1–42.
  14. Unsupervised dense information retrieval with contrastive learning. ArXiv preprint, abs/2112.09118.
  15. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP, pages 6769–6781, Online. Association for Computational Linguistics.
  16. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439.
  17. Visualizing the loss landscape of neural nets. In Proceedings of NeurIPS, pages 6391–6401.
  18. Distributionally robust optimization: A review on theory and applications. Numerical Algebra, Control and Optimization, 12(1):159–212.
  19. Openmatch: An open source library for neu-ir research. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2531–2535.
  20. Augtriever: Unsupervised dense retrieval by scalable data augmentation.
  21. COCO-LM: correcting and contrasting text sequences for language model pretraining. In Proceedings of NeurIPS, pages 23102–23114.
  22. Robert C. Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden. Association for Computational Linguistics.
  23. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  24. Representation learning with contrastive predictive coding. ArXiv preprint, abs/1807.03748.
  25. Distributionally robust language modeling. In Proceedings of EMNLP, pages 4227–4237, Hong Kong, China. Association for Computational Linguistics.
  26. Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  28. Hamed Rahimian and Sanjay Mehrotra. 2019. Distributionally robust optimization: A review. ArXiv preprint, abs/1908.05659.
  29. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  30. Okapi at trec-3. Nist Special Publication Sp, 109:109.
  31. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ArXiv preprint, abs/1911.08731.
  32. D. Sculley. 2010. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, pages 1177–1178. ACM.
  33. Burr Settles. 2011. From theories to queries: Active learning in practice. In Active learning and experimental design workshop in conjunction with AISTATS 2010, pages 1–18. JMLR Workshop and Conference Proceedings.
  34. Few-shot text ranking with meta adapted synthetic weak supervision. In Proceedings of ACL, pages 5030–5043, Online. Association for Computational Linguistics.
  35. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. ArXiv preprint, abs/2104.08663.
  36. A fine-grained analysis on distribution shift. In Proc. of ICLR. OpenReview.net.
  37. Doremi: Optimizing data mixtures speeds up language model pretraining. ArXiv preprint, abs/2305.10429.
  38. Unsupervised dense retrieval training with web anchors. ArXiv preprint, abs/2305.05834.
  39. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proc. of ICLR. OpenReview.net.
  40. Openmatch-v2: An all-in-one multi-modality plm-based information retrieval toolkit. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3160–3164.
  41. Parallel corpus filtering via pre-trained language models. In Proceedings of ACL, pages 8545–8554, Online. Association for Computational Linguistics.
  42. Selective weak supervision for neural information retrieval. In Proceedings of WWW, pages 474–485. ACM / IW3C2.

Summary

We haven't generated a summary for this paper yet.