Unsupervised Domain Adaption for Neural Information Retrieval (2310.09350v1)
Abstract: Neural information retrieval requires costly annotated data for each target domain to be competitive. Synthetic annotation by query generation using LLMs or rule-based string manipulation has been proposed as an alternative, but their relative merits have not been analysed. In this paper, we compare both methods head-to-head using the same neural IR architecture. We focus on the BEIR benchmark, which includes test datasets from several domains with no training data, and explore two scenarios: zero-shot, where the supervised system is trained in a large out-of-domain dataset (MS-MARCO); and unsupervised domain adaptation, where, in addition to MS-MARCO, the system is fine-tuned in synthetic data from the target domain. Our results indicate that LLMs outperform rule-based methods in all scenarios by a large margin, and, more importantly, that unsupervised domain adaptation is effective compared to applying a supervised IR system in a zero-shot fashion. In addition we explore several sizes of open LLMs to generate synthetic data and find that a medium-sized model suffices. Code and models are publicly available for reproducibility.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
- Inpars: Data augmentation for information retrieval using large language models.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluation of bert and albert sentence embedding performance on downstream nlp tasks. In 2020 25th International conference on pattern recognition (ICPR), pages 5482–5487. IEEE.
- Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2288–2292.
- The vocabulary problem in human-system communication. Communications of the ACM, 30(11):964–971.
- Pcrs: Personalized course recommender system based on hybrid approach. Procedia Computer Science, 125:518–524.
- Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium, ADCS ’15, New York, NY, USA. Association for Computing Machinery.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Gautier Izacard and Édouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Mei Kobayashi and Koichi Takeda. 2000. Information retrieval on the web. ACM computing surveys (CSUR), 32(2):144–173.
- Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
- MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
- Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3:333–389.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Bloom: A 176b-parameter open-access multilingual language model.
- Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- CONQRR: Conversational query rewriting for retrieval with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10000–10014, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Laprador: Unsupervised pretrained dense retriever for zero-shot text retrieval. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3557–3569.
- Docchat: An information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 516–525.
- Opt: Open pre-trained transformer language models.