RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering (2010.08191v2)

Published 16 Oct 2020 in cs.CL and cs.IR

Abstract: In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA. Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.

PDF Abstract

Overview of RocketQA: An Optimized Approach for Dense Passage Retrieval

The paper "RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering" addresses the challenging task of efficiently training dense passage retrievers within the context of open-domain question answering (QA). The authors aim to enhance the performance of dual-encoder architectures through innovative strategies responding to the prevalent discrepancies between training and inference processes, the presence of unlabeled positives, and the constraints of limited training data.

Technical Contributions

Cross-Batch Negatives: The traditional in-batch negative sampling is extended to cross-batch negatives, significantly increasing the diversity and number of negative samples available during training without additional sampling complexity. This strategy effectively mitigates the discrepancy between training and inference stages by incorporating negatives from across multiple batches.
Denoised Hard Negatives: Leveraging a cross-encoder architecture to filter out false negatives, this approach ensures that the hard negatives used for training are truly non-relevant, thereby improving the model's discriminative ability. The cross-encoder's superior capability in semantic matching is thus tactically utilized to refine the negative sampling process.
Data Augmentation: By employing the cross-encoder to label large-scale unlabeled data, RocketQA expands the training set with high-quality pseudo-labeled data, promoting robust learning. This technique allows for the dual-encoder to be enhanced without the need for extensive manual annotations.

Experimental Evaluation

The authors conducted extensive experiments using well-established QA datasets, MSMARCO and Natural Questions (NQ), demonstrating the superior performance of RocketQA compared to existing approaches. On MSMARCO, RocketQA achieved a remarkable MRR@10 of 37.0, outperforming previous state-of-the-art models. Similarly, on the NQ dataset, the model yielded substantial enhancements in retrieval quality, as evidenced by improved recall scores across various top-k settings.

Implications and Future Directions

The proposed methodologies contribute theoretically to the enhancement of dual-encoder architectures by showcasing effective strategies to deal with common training challenges in dense retrieval. Practically, RocketQA's experimental results suggest that significant improvements in QA system capabilities can be attained by addressing negative sampling and data augmentation comprehensively. These strategies are not confined to passage retrieval but can be extended to other domains where similar training disparities are present.

The integration of cross-encoder mechanisms in training dense retrievers opens an avenue for developing hybrid models that combine the strengths of different encoding architectures, potentially leading to even more refined and efficient models in future research.

Summarily, RocketQA represents an essential advancement in the optimization of dense passage retrieval systems, providing a robust framework that enhances both theoretical understanding and practical application in open-domain QA tasks.