Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Dense Retrieval Adaptation using Target Domain Description (2307.02740v1)

Published 6 Jul 2023 in cs.IR and cs.CL

Abstract: In information retrieval (IR), domain adaptation is the process of adapting a retrieval model to a new domain whose data distribution is different from the source domain. Existing methods in this area focus on unsupervised domain adaptation where they have access to the target document collection or supervised (often few-shot) domain adaptation where they additionally have access to (limited) labeled data in the target domain. There also exists research on improving zero-shot performance of retrieval models with no adaptation. This paper introduces a new category of domain adaptation in IR that is as-yet unexplored. Here, similar to the zero-shot setting, we assume the retrieval model does not have access to the target document collection. In contrast, it does have access to a brief textual description that explains the target domain. We define a taxonomy of domain attributes in retrieval tasks to understand different properties of a source domain that can be adapted to a target domain. We introduce a novel automatic data construction pipeline that produces a synthetic document collection, query set, and pseudo relevance labels, given a textual domain description. Extensive experiments on five diverse target domains show that adapting dense retrieval models using the constructed synthetic data leads to effective retrieval performance on the target domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Task-aware Retrieval with Instructions. arXiv preprint.
  2. Overview of Touché 2021: Argument Retrieval. Springer, Online, 574–582.
  3. A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In Advances in Information Retrieval. Springer International Publishing, Cham, 716–722.
  4. Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., Online, 1877–1901.
  5. Christopher J. C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report. Microsoft Research.
  6. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2270–2282.
  7. Overview of the TREC 2020 Deep Learning Track. In Proceedings of the 2020 Text REtrieval Conference.
  8. Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 65–74.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
  10. Precise Zero-Shot Dense Retrieval without Relevance Labels. arXiv preprint.
  11. A Deep Look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
  12. CQADupStack: A Benchmark Data Set for Community Question-Answering Research. In Proceedings of the 20th Australasian Document Computing Symposium (ADCS ’15). Association for Computing Machinery, New York, NY, USA, 8 pages.
  13. Towards Unsupervised Dense Information Retrieval with Contrastive Learning.
  14. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781.
  15. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  16. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  17. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint (2019).
  18. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. In Companion Proceedings of the The Web Conference 2018 (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1941–1942.
  19. Bhaskar Mitra and Nick Craswell. 2018. An Introduction to Neural Information Retrieval. Found. Trends Inf. Retr. 13, 1 (dec 2018), 1–126.
  20. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Vol. 1773). CEUR-WS.org.
  21. Rodrigo Nogueira. 2019. From doc2query to docTTTTTquery. arXiv preprint.
  22. Rodrigo Frassetto Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint.
  23. OpenAI. 2023. Aligning language models to follow instructions.
  24. Long Ouyang et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint.
  25. Barbara Plank. 2011. Domain adaptation for parsing. arXiv preprint.
  26. Barbara Plank. 2016. What to do about non-standard (or non-canonical) language in NLP. arXiv preprint.
  27. Improving language understanding by generative pre-training. arXiv preprint.
  28. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
  29. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992.
  30. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3). Maryland, United States.
  31. Victor Sanh et al. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization.
  32. Meta Adaptive Neural Ranking with Contrastive Synthetic Supervision. arXiv preprint.
  33. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  34. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. SIGIR Forum 54, 1, Article 1 (feb 2021), 12 pages.
  35. Ellen M. Voorhees. 2004. Overview of the TREC 2004 Robust Track. In Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004 (NIST Special Publication, Vol. 500-261). National Institute of Standards and Technology (NIST).
  36. Retrieval of the Best Counterargument without Prior Topic Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Melbourne, Australia, 241–251.
  37. Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on EMNLP. Association for Computational Linguistics, Online, 7534–7550.
  38. GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2345–2360.
  39. Lucy Lu Wang et al. 2020. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Online.
  40. Yizhong et al. Wang. 2022. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109.
  41. Finetuned Language Models Are Zero-Shot Learners. arXiv preprint.
  42. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations (ICLR ’21).
  43. Hamed Zamani and W. Bruce Croft. 2018. On the Theory of Weak Supervision for Information Retrieval. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval (Tianjin, China) (ICTIR ’18). Association for Computing Machinery, New York, NY, USA, 147–154. https://doi.org/10.1145/3234944.3234968
  44. Curriculum Learning for Dense Retrieval Distillation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 1979–1983.
  45. Peide Zhu and Claudia Hauff. 2022. Unsupervised Domain Adaptation for Question Generation with DomainData Selection and Self-training. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, Seattle, United States, 2388–2401.
Citations (7)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.