Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation (2212.08841v4)

Published 17 Dec 2022 in cs.CL and cs.IR

Abstract: Dense retrievers have made significant strides in text retrieval and open-domain question answering. However, most of these achievements have relied heavily on extensive human-annotated supervision. In this study, we aim to develop unsupervised methods for improving dense retrieval models. We propose two approaches that enable annotation-free and scalable training by creating pseudo querydocument pairs: query extraction and transferred query generation. The query extraction method involves selecting salient spans from the original document to generate pseudo queries. On the other hand, the transferred query generation method utilizes generation models trained for other NLP tasks, such as summarization, to produce pseudo queries. Through extensive experimentation, we demonstrate that models trained using these augmentation methods can achieve comparable, if not better, performance than multiple strong dense baselines. Moreover, combining these strategies leads to further improvements, resulting in superior performance of unsupervised dense retrieval, unsupervised domain adaptation and supervised finetuning, benchmarked on both BEIR and ODQA datasets. Code and datasets are publicly available at https://github.com/salesforce/AugTriever.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Rui Meng (54 papers)
  2. Ye Liu (153 papers)
  3. Semih Yavuz (43 papers)
  4. Divyansh Agarwal (15 papers)
  5. Lifu Tu (19 papers)
  6. Ning Yu (78 papers)
  7. Jianguo Zhang (97 papers)
  8. Meghana Bhat (6 papers)
  9. Yingbo Zhou (81 papers)