How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Published 15 Feb 2023 in cs.IR and cs.CL | (2302.07452v1)

Abstract: Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Abstract PDF Upgrade to Chat

Citations (76)

View on Semantic Scholar

Summary

The paper’s main contribution is a novel data augmentation framework that refines relevance labels and query diversity to enhance dense retriever training.
It presents a cost-effective method using inexpensive, large-scale query augmentation, such as cropped sentences, instead of costly generative approaches.
Empirical results demonstrate that DRAGON achieves state-of-the-art performance in both supervised and zero-shot retrieval without increasing model complexity.

Diverse Augmentation Strategies for Training Generalizable Dense Retrievers

Introduction

In the field of information retrieval, dense retrievers (DRs) have gained prominence for their ability to efficiently sift through large datasets to find relevant information. Existing DR training methodologies, including unsupervised contrastive learning and pseudo-query generation, have shown promise but often at the expense of either supervised or zero-shot retrieval effectiveness. The common belief links this trade-off to limited model capacity. Challenging this notion, new research demonstrates that a generalizable dense retriever can be trained to achieve high accuracy across both tasks without necessarily increasing the model size. The key lies in a systematic examination of data augmentation (DA) practices within the contrastive learning framework for DRs.

Data Augmentation for Contrastive Learning

The study identifies common DA practices—such as query augmentation with generative models and label creation using cross-encoders—as often inefficient and sub-optimal. It introduces a novel DA approach, focusing on developing diverse queries and leveraging various sources of supervision. This method enables the progressive training of a generalizable DR, achieving state-of-the-art effectiveness in both supervised and zero-shot evaluations, even outperforming models reliant on more complex late interaction mechanisms.

Empirical Insights

Through detailed empirical exploration, the research uncovers pivotal insights for DR training. In particular:

Relevance Label Augmentation: The challenge in training generalizable DRs lies in creating diverse relevance labels for each query. By employing multiple retrievers, as opposed to solely relying on a strong cross-encoder, the study illustrates the effectiveness of leveraging a range of relevance signals.
Query Augmentation: The findings advocate for using cheap and large-scale augmented queries (e.g., cropped sentences) rather than expensive neural generative queries. This approach not only reduces costs but also enhances the retriever's capability to generalize across different domains.

Moreover, the direct learning from diverse relevance labels sourced from multiple retrievers is highlighted as suboptimal. The study proposes a method for progressively augmenting relevance labels, thereby facilitating more effective learning.

Contributions and Practical Implications

The paper makes several notable contributions. It presents a systematic evaluation of DR training under the lens of data augmentation, shedding light on how to improve training methods for dense retrievers. The introduction of a progressive label augmentation strategy is particularly noteworthy for guiding the learning of complex relevance signals. Practically, the research showcases DRAGON, a BERT-base-sized dense retriever, which excels in retrieval effectiveness without increased model complexity. This advancement suggests the viability of employing DRAGON as a robust foundation model for domain adaptation tasks in retrieval systems.

Speculations on Future Developments

Looking ahead, the findings prompt a reevaluation of the role of data augmentation and the training of dense retrievers. The remarkable performance of DRAGON—armed with a diverse augmentation strategy—hints at the untapped potential of existing model architectures when coupled with innovative training regimes. Future research may explore the integration of generative and contrastive pre-training or explore domain-specific pre-training to address identified weaknesses in zero-shot retriever tasks. Such explorations could further diminish the gap between supervised and zero-shot effectiveness, paving the way for more versatile and efficient retrieval systems.

In sum, this research stands as a testament to the power of strategic data augmentation in enhancing the generalizability of dense retrievers. By rethinking conventional training paradigms, DRAGON emerges as a testament to the potential within reach, heralding a new era for information retrieval systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Summary

Diverse Augmentation Strategies for Training Generalizable Dense Retrievers

Introduction

Data Augmentation for Contrastive Learning

Empirical Insights

Contributions and Practical Implications

Speculations on Future Developments

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Summary

Diverse Augmentation Strategies for Training Generalizable Dense Retrievers

Introduction

Data Augmentation for Contrastive Learning

Empirical Insights

Contributions and Practical Implications

Speculations on Future Developments

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets