Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval (2302.07452v1)

Published 15 Feb 2023 in cs.IR and cs.CL

Abstract: Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Diverse Augmentation Strategies for Training Generalizable Dense Retrievers

Introduction

In the field of information retrieval, dense retrievers (DRs) have gained prominence for their ability to efficiently sift through large datasets to find relevant information. Existing DR training methodologies, including unsupervised contrastive learning and pseudo-query generation, have shown promise but often at the expense of either supervised or zero-shot retrieval effectiveness. The common belief links this trade-off to limited model capacity. Challenging this notion, new research demonstrates that a generalizable dense retriever can be trained to achieve high accuracy across both tasks without necessarily increasing the model size. The key lies in a systematic examination of data augmentation (DA) practices within the contrastive learning framework for DRs.

Data Augmentation for Contrastive Learning

The paper identifies common DA practices—such as query augmentation with generative models and label creation using cross-encoders—as often inefficient and sub-optimal. It introduces a novel DA approach, focusing on developing diverse queries and leveraging various sources of supervision. This method enables the progressive training of a generalizable DR, achieving state-of-the-art effectiveness in both supervised and zero-shot evaluations, even outperforming models reliant on more complex late interaction mechanisms.

Empirical Insights

Through detailed empirical exploration, the research uncovers pivotal insights for DR training. In particular:

  • Relevance Label Augmentation: The challenge in training generalizable DRs lies in creating diverse relevance labels for each query. By employing multiple retrievers, as opposed to solely relying on a strong cross-encoder, the paper illustrates the effectiveness of leveraging a range of relevance signals.
  • Query Augmentation: The findings advocate for using cheap and large-scale augmented queries (e.g., cropped sentences) rather than expensive neural generative queries. This approach not only reduces costs but also enhances the retriever's capability to generalize across different domains.

Moreover, the direct learning from diverse relevance labels sourced from multiple retrievers is highlighted as suboptimal. The paper proposes a method for progressively augmenting relevance labels, thereby facilitating more effective learning.

Contributions and Practical Implications

The paper makes several notable contributions. It presents a systematic evaluation of DR training under the lens of data augmentation, shedding light on how to improve training methods for dense retrievers. The introduction of a progressive label augmentation strategy is particularly noteworthy for guiding the learning of complex relevance signals. Practically, the research showcases DRAGON, a BERT-base-sized dense retriever, which excels in retrieval effectiveness without increased model complexity. This advancement suggests the viability of employing DRAGON as a robust foundation model for domain adaptation tasks in retrieval systems.

Speculations on Future Developments

Looking ahead, the findings prompt a reevaluation of the role of data augmentation and the training of dense retrievers. The remarkable performance of DRAGON—armed with a diverse augmentation strategy—hints at the untapped potential of existing model architectures when coupled with innovative training regimes. Future research may explore the integration of generative and contrastive pre-training or delve into domain-specific pre-training to address identified weaknesses in zero-shot retriever tasks. Such explorations could further diminish the gap between supervised and zero-shot effectiveness, paving the way for more versatile and efficient retrieval systems.

In sum, this research stands as a testament to the power of strategic data augmentation in enhancing the generalizability of dense retrievers. By rethinking conventional training paradigms, DRAGON emerges as a testament to the potential within reach, heralding a new era for information retrieval systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sheng-Chieh Lin (31 papers)
  2. Akari Asai (35 papers)
  3. Minghan Li (38 papers)
  4. Barlas Oguz (36 papers)
  5. Jimmy Lin (208 papers)
  6. Yashar Mehdad (37 papers)
  7. Wen-tau Yih (84 papers)
  8. Xilun Chen (31 papers)
Citations (76)
X Twitter Logo Streamline Icon: https://streamlinehq.com