Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers (2303.00807v3)

Published 1 Mar 2023 in cs.IR and cs.CL

Abstract: Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using LLMs to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains and achieves substantially lower latency than standard reranking methods.

Overview of "UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers"

In the field of information retrieval (IR), neural models have demonstrated significant advancements in performance when applied to various tasks such as document retrieval and question answering. However, a persistent challenge for these models is adapting to domain shifts where the distribution of queries and documents in the target domain differs from the training dataset. The paper "UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers" introduces a novel approach to address these challenges by leveraging LLMs for generating synthetic queries as a means of unsupervised domain adaptation.

Methodology

The proposed method, UDAPDR, innovatively combines LLM prompting with a multi-stage distillation process to enhance retrieval accuracy in zero-shot environments. The approach is structured into several key stages:

  1. Initial Synthetic Query Generation: Using a powerful LLM like GPT-3, a small initial set of synthetic queries is generated for the target domain passages. These are used as high-quality examples to create prompts.
  2. Large-scale Query Generation: A more efficient LLM such as Flan-T5 XXL is then utilized to generate a much larger set of synthetic queries based on the prompts formed in the previous step. This step focuses on cost-effective query generation.
  3. Training of Rerankers: The synthetic queries are employed to fine-tune multiple passage rerankers, each corresponding to different adaptations derived from the synthetic query sets.
  4. Distillation into a Single Retriever: The outputs of these rerankers are distilled into a single ColBERTv2 retriever. This step aims to accumulate the knowledge from multiple sources into one efficient model that maintains retrieval accuracy while lowering computational costs.
  5. Evaluation and Deployment: The refined retriever is evaluated in the target domain using standard retrieval performance metrics, establishing the parameters for deployment in actual retrieval tasks.

Experimental Results

The experimental section of the paper demonstrates the efficacy of the UDAPDR approach across several challenging datasets, notably LoTTE and BEIR, as well as on well-known benchmarks like Natural Questions and SQuAD. By employing both single and multiple reranker strategies, significant improvements in Success@5 and nDCG@10 metrics were observed over zero-shot baselines and other contemporary domain adaptation techniques.

Notably, the comparisons include baselines such as SPLADEv2, RocketQAv2, and adaptations using existing BM25 reranking methods. UDAPDR consistently improves performance, oftentimes with lower resource expenditure due to its intelligent use of synthetic data for domain adaptation without requiring access to in-domain labeled data.

Implications and Future Directions

The research advances the understanding of unsupervised domain adaptation for IR by demonstrating that leveraging LLMs for synthetic data generation, combined with a thoughtful distillation process, can effectively mitigate domain shift challenges. Practically, this could lead to more robust IR systems capable of handling domain-specific retrieval tasks without hefty annotation costs.

Future work might explore the application of the UDAPDR framework to other types of neural retrievers or investigate the effectiveness of various LLM configurations. Also, there is potential in examining cross-lingual adaptations where the method could be extended to facilitate domain adaptation across different languages, further broadening its applicability in diverse data environments.

In conclusion, UDAPDR represents a meaningful advancement in IR, offering a pragmatic and effective solution for enhancing model robustness and accuracy in novel domains through unsupervised techniques. The methodology balances computational efficiency and model performance, which could inspire similar innovations in adjacent fields of AI and machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390.
  2. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260.
  3. Inpars: Data augmentation for information retrieval using large language models. arXiv preprint arXiv:2202.05144.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Pre-training Tasks for Embedding-based Large-scale Retrieval. In International Conference on Learning Representations.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  8. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. To adapt or to annotate: Challenges and interventions for domain adaptation in open-domain question answering. arXiv preprint arXiv:2212.10381.
  11. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086.
  12. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR.
  13. Neural Vector Spaces for Unsupervised Information Retrieval. ACM Trans. Inf. Syst., 36(4).
  14. DeBERTaV3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
  15. Generate, Annotate, and Learn: NLP with Synthetic Text. Transactions of the Association for Computational Linguistics, 10:826–842.
  16. Improving efficient neural ranking models with cross-architecture knowledge distillation.
  17. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  18. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
  19. Unsupervised dense information retrieval with contrastive learning.
  20. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  21. Inpars-v2: Large language models as efficient dataset generators for information retrieval. arXiv preprint arXiv:2301.01820.
  22. Evaluating embedding apis for information retrieval.
  23. Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval. In Thirty-Fifth Conference on Neural Information Processing Systems.
  24. Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
  25. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 39–48. ACM.
  26. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  27. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26, Suzhou, China. Association for Computational Linguistics.
  28. Natural Questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  29. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
  30. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  31. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  32. Zero-shot neural passage retrieval via domain-targeted synthetic question generation. arXiv preprint arXiv:2004.14503.
  33. Unsupervised dense retrieval deserves better positive pairs: Scalable augmentation with query extraction and generation. arXiv preprint arXiv:2212.08841.
  34. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPs.
  35. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085.
  36. From doc2query to doctttttquery. Online preprint, 6.
  37. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  38. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  39. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  40. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  41. PLAID: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1747–1756.
  42. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  43. Moving beyond downstream task accuracy for information retrieval benchmarking. arXiv preprint arXiv:2212.01340.
  44. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. arXiv preprint arXiv:2010.08240.
  45. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  46. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
  47. TREC-COVID: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
  48. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2345–2360, Seattle, United States. Association for Computational Linguistics.
  49. CORD-19: The COVID-19 open research dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Online. Association for Computational Linguistics.
  50. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
  51. Zero-shot dense retrieval with momentum adversarial domain invariant representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4008–4020, Dublin, Ireland. Association for Computational Linguistics.
  52. Generative data augmentation for commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jon Saad-Falcon (19 papers)
  2. Omar Khattab (34 papers)
  3. Keshav Santhanam (15 papers)
  4. Radu Florian (54 papers)
  5. Martin Franz (9 papers)
  6. Salim Roukos (41 papers)
  7. Avirup Sil (45 papers)
  8. Md Arafat Sultan (25 papers)
  9. Christopher Potts (113 papers)
Citations (28)