Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Domain Adaptation for Dense Retrieval and Conversational Dense Retrieval through Self-Supervision by Meticulous Pseudo-Relevance Labeling (2403.08970v1)

Published 13 Mar 2024 in cs.IR

Abstract: Recent studies have demonstrated that the ability of dense retrieval models to generalize to target domains with different distributions is limited, which contrasts with the results obtained with interaction-based models. Prior attempts to mitigate this challenge involved leveraging adversarial learning and query generation approaches, but both approaches nevertheless resulted in limited improvements. In this paper, we propose to combine the query-generation approach with a self-supervision approach in which pseudo-relevance labels are automatically generated on the target domain. To accomplish this, a T5-3B model is utilized for pseudo-positive labeling, and meticulous hard negatives are chosen. We also apply this strategy on conversational dense retrieval model for conversational search. A similar pseudo-labeling approach is used, but with the addition of a query-rewriting module to rewrite conversational queries for subsequent labeling. This proposed approach enables a model's domain adaptation with real queries and documents from the target dataset. Experiments on standard dense retrieval and conversational dense retrieval models both demonstrate improvements on baseline models when they are fine-tuned on the pseudo-relevance labeled data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Domain generalization by marginal transfer learning. The Journal of Machine Learning Research, 22(1):46–100.
  2. Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An overview. Technical report, Microsoft Research.
  3. Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2229–2238.
  4. Research frontiers in information retrieval: Report from the third strategic workshop on information retrieval in lorne (swirl 2018). In ACM SIGIR Forum, volume 52, pages 34–90. ACM New York, NY, USA.
  5. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  6. Trec cast 2019: The conversational assistance track overview. arXiv preprint arXiv:2003.13624.
  7. Learning to learn with variational information bottleneck for domain generalization. In European Conference on Computer Vision, pages 200–216. Springer.
  8. Antonio D’Innocente and Barbara Caputo. 2018. Domain generalization with domain-specific aggregation modules. In German Conference on Pattern Recognition, pages 187–198. Springer.
  9. Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5918–5924.
  10. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030.
  11. Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 981–993.
  12. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853.
  13. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
  14. A deep look into neural ranking models for information retrieval. Information Processing & Management, 57(6):102067.
  15. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666.
  16. Feature stylization and domain-aware contrastive learning for domain generalization. In Proceedings of the 29th ACM International Conference on Multimedia, pages 22–31.
  17. Udalm: Unsupervised domain adaptation through language modeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2579–2590.
  18. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  19. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
  20. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
  21. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  22. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550.
  23. Minghan Li and Eric Gaussier. 2022. Bert-based dense intra-ranking and contextualized late interaction via multi-task learning for long document retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2347–2352.
  24. Embedding-based zero-shot retrieval through query generation. arXiv preprint arXiv:2009.10270.
  25. Contextualized query embeddings for conversational search. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1004–1015.
  26. Learning causal semantic representation for out-of-distribution prediction. Advances in Neural Information Processing Systems, 34:6155–6170.
  27. Ilya Loshchilov and Frank Hutter. 2017a. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  28. Ilya Loshchilov and Frank Hutter. 2017b. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
  29. Zero-shot neural passage retrieval via domain-targeted synthetic question generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1075–1088.
  30. Www’18 open challenge: Financial opinion mining and question answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  31. Best sources forward: domain generalization through source-specific nets. In 2018 25th IEEE international conference on image processing (ICIP), pages 1353–1357. IEEE.
  32. Convtrans: Transforming web search sessions for conversational dense retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2935–2946.
  33. Topic propagation in conversational search. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 2057–2060.
  34. Mixed precision training. In International Conference on Learning Representations.
  35. A systematic evaluation of transfer learning and pseudo-labeling with bert-based ranking models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2081–2085.
  36. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE international conference on computer vision, pages 5715–5725.
  37. Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8690–8699.
  38. MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org.
  39. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  41. Structured domain randomization: Bridging the reality gap by context-aware synthetic data. In 2019 International Conference on Robotics and Automation (ICRA), pages 7249–7255. IEEE.
  42. Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12556–12565.
  43. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5835–5847.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  45. Multi-component image translation for deep domain generalization. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 579–588. IEEE.
  46. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  47. Conversational query understanding using sequence to sequence modeling. In Proceedings of the 2018 World Wide Web Conference, pages 1715–1724.
  48. Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  49. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  50. Generalizing across domains via cross-gradient training. In International Conference on Learning Representations.
  51. Few-shot text ranking with meta adapted synthetic weak supervision. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5030–5043.
  52. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  53. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE.
  54. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138.
  55. Question rewriting for conversational question answering. In Proceedings of the 14th ACM international conference on web search and data mining, pages 355–363.
  56. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31.
  57. Ellen Voorhees. 2005. Overview of the trec 2004 robust retrieval track. Special Publication (NIST SP), National Institute of Standards and Technology, Gaithersburg, MD.
  58. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering.
  59. Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 671–688.
  60. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2345–2360, Seattle, United States. Association for Computational Linguistics.
  61. M. Wang and W. Deng. 2018. Deep visual domain adaptation: a survey. Neurocomputing.
  62. Zero-shot dense retrieval with momentum adversarial domain invariant representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4008–4020.
  63. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
  64. Anserini: Reproducible ranking baselines using lucene. Journal of Data and Information Quality (JDIQ), 10(4):1–20.
  65. Few-shot conversational dense retrieval. In Proceedings of the 44th International ACM SIGIR Conference on research and development in information retrieval, pages 829–838.
  66. Conversational information seeking.
  67. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
  68. SimANS: Simple ambiguous negatives sampling for dense text retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 548–559, Abu Dhabi, UAE. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Minghan Li (38 papers)
  2. Eric Gaussier (27 papers)
Citations (2)