Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval (2112.07577v3)

Published 14 Dec 2021 in cs.CL and cs.IR

Abstract: Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets. In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 9.3 points nDCG@10. GPL requires less (unlabeled) data from the target domain and is more robust in its training than previous methods. We further investigate the role of six recent pre-training methods in the scenario of domain adaptation for retrieval tasks, where only three could yield improved results. The best approach, TSDAE (Wang et al., 2021) can be combined with GPL, yielding another average improvement of 1.4 points nDCG@10 across the six tasks. The code and the models are available at https://github.com/UKPLab/gpl.

Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

The paper introduces Generative Pseudo Labeling (GPL), an innovative approach for enhancing unsupervised domain adaptation in dense retrieval models. Dense retrieval methods are recognized for their potential to bridge the lexical gap, yet their performance is notably compromised when applied to domains they weren't specifically trained on due to a shortage of domain-specific labeled data.

Methodology

GPL is structured to address these challenges by synergistically combining a query generator with pseudo labeling derived from a cross-encoder. The method operates through the following stages:

  1. Query Generation: Using a pre-trained T5 encoder-decoder, synthetic queries are produced for each passage in the target domain.
  2. Negative Passage Mining: An existing dense retrieval model is employed to extract negative passages for each generated query.
  3. Pseudo Labeling: A cross-encoder assigns scores to (query, passage) pairs, facilitating the training of a dense retrieval model on these pseudo-labeled queries using MarginMSE-Loss.

This framework capitalizes on the robustness of cross-encoders, particularly in soft-labeling query-passage pairs, which ameliorates the impact of poorly generated queries and mitigates false negatives that typically challenge hard-negative mining.

Experimental Analysis

The paper presents a comprehensive evaluation on six domain-specific datasets from the BeIR benchmark. Results demonstrate that GPL significantly surpasses state-of-the-art models trained on generalized datasets like MS MARCO, improving scores by a margin of up to 9.3 nDCG@10 points. By integrating TSDAE-based pre-training, further improvements were observed, averaging an additional increase of 1.4 nDCG@10 points across the datasets.

Implications and Future Directions

The research underscores GPL’s capability to adapt dense retrieval models to new domains effectively, leveraging minimal unlabeled target domain data. This implies significant advancements in capacities for domain-specific applications, which traditionally struggled due to data constraints.

On the theoretical plane, GPL introduces a scalable mechanism for pseudo-label-based training, potentially setting a precedent for future unsupervised adaptation methodologies. Future avenues for exploration could involve refining the training pipeline to enhance efficiency or extending GPL to encompass multi-modal data in retrieval scenarios.

Conclusion

GPL represents a pivotal stride in unsupervised domain adaptation within dense retrieval frameworks, offering promising improvements in retrieval efficacy across varied domains. Its reliance on sophisticated pseudo-labeling exemplifies a notable shift towards more nuanced strategies in overcoming domain-specific challenges in information retrieval systems. The integration of pre-trained models alongside innovative pseudo-labeling techniques makes GPL a noteworthy contribution to the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kexin Wang (41 papers)
  2. Nandan Thakur (24 papers)
  3. Nils Reimers (25 papers)
  4. Iryna Gurevych (264 papers)
Citations (134)
Youtube Logo Streamline Icon: https://streamlinehq.com