Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval
The paper introduces Generative Pseudo Labeling (GPL), an innovative approach for enhancing unsupervised domain adaptation in dense retrieval models. Dense retrieval methods are recognized for their potential to bridge the lexical gap, yet their performance is notably compromised when applied to domains they weren't specifically trained on due to a shortage of domain-specific labeled data.
Methodology
GPL is structured to address these challenges by synergistically combining a query generator with pseudo labeling derived from a cross-encoder. The method operates through the following stages:
- Query Generation: Using a pre-trained T5 encoder-decoder, synthetic queries are produced for each passage in the target domain.
- Negative Passage Mining: An existing dense retrieval model is employed to extract negative passages for each generated query.
- Pseudo Labeling: A cross-encoder assigns scores to (query, passage) pairs, facilitating the training of a dense retrieval model on these pseudo-labeled queries using MarginMSE-Loss.
This framework capitalizes on the robustness of cross-encoders, particularly in soft-labeling query-passage pairs, which ameliorates the impact of poorly generated queries and mitigates false negatives that typically challenge hard-negative mining.
Experimental Analysis
The paper presents a comprehensive evaluation on six domain-specific datasets from the BeIR benchmark. Results demonstrate that GPL significantly surpasses state-of-the-art models trained on generalized datasets like MS MARCO, improving scores by a margin of up to 9.3 nDCG@10 points. By integrating TSDAE-based pre-training, further improvements were observed, averaging an additional increase of 1.4 nDCG@10 points across the datasets.
Implications and Future Directions
The research underscores GPL’s capability to adapt dense retrieval models to new domains effectively, leveraging minimal unlabeled target domain data. This implies significant advancements in capacities for domain-specific applications, which traditionally struggled due to data constraints.
On the theoretical plane, GPL introduces a scalable mechanism for pseudo-label-based training, potentially setting a precedent for future unsupervised adaptation methodologies. Future avenues for exploration could involve refining the training pipeline to enhance efficiency or extending GPL to encompass multi-modal data in retrieval scenarios.
Conclusion
GPL represents a pivotal stride in unsupervised domain adaptation within dense retrieval frameworks, offering promising improvements in retrieval efficacy across varied domains. Its reliance on sophisticated pseudo-labeling exemplifies a notable shift towards more nuanced strategies in overcoming domain-specific challenges in information retrieval systems. The integration of pre-trained models alongside innovative pseudo-labeling techniques makes GPL a noteworthy contribution to the field.