DuReaderretrieval: A Comprehensive Chinese Dataset for Passage Retrieval
The paper introduces DuReaderretrieval, an extensive Chinese dataset curated for the evaluation and benchmarking of passage retrieval systems. This dataset includes over 90,000 queries and 8 million unique passages sourced from a commercial search engine, notably Baidu. The creation of DuReaderretrieval is a response to the limitations observed in current datasets, particularly those geared towards non-English languages. Unlike other datasets which suffer from small scale or machine-generated queries, DuReaderretrieval is human-annotated, providing a more reliable basis for model training and evaluation.
Improvements and Features
DuReaderretrieval distinguishes itself through several key improvements over its predecessors:
- Manual Annotation for Quality Assurance: The development and test sets have been meticulously annotated to minimize false negatives, a prevalent issue in many large-scale datasets due to limited human annotation presence.
- Exclusion of Overlapping Queries: To address potential leaks of testing information, semantically similar queries between training and testing datasets have been identified and excluded using sophisticated query matching models.
- Cross-Domain and Cross-Lingual Evaluations: The dataset not only offers primary testing sets but also includes two domain-specific testing sets for out-of-domain evaluation, as well as a set of human-translated queries for assessing cross-lingual retrieval capabilities.
Experimental Findings
Experiments conducted with DuReaderretrieval highlight significant challenges in current retrieval paths, including the mismatch of salient phrases and syntactic variations between queries and passages. Dense retrievers, while effective within domain, show poor generalization across domains and struggle with cross-lingual retrieval tasks, underscoring the persistent challenges in achieving truly flexible retrieval systems.
Comparative Analysis
The dataset's scale and manual refinement position it as a formidable benchmark for Chinese-language passage retrieval, filling a crucial gap left by prior datasets such as TianGong-PDR and Sougou-QCL, which lack comprehensive human annotation or are limited by size. DuReaderretrieval shares similarities with prominent English datasets like MS-MARCO and Natural Questions, yet marks substantial advancements tailored for Chinese retrieval demands.
Implications for Future Research
The implications of DuReaderretrieval are multifaceted. Practically, it offers a strong foundation for constructing more accurate and contextually aware Chinese retrievers. Theoretically, it challenges assumptions held in transfer learning and cross-lingual adaptation, pushing for advancements in these areas.
As models continue to evolve, DuReaderretrieval provides an essential platform for testing the limits and capabilities of retrieval algorithms, paving the way for more sophisticated, versatile, and domain-agnostic retrieval systems. This dataset sets the stage for future explorations in cross-lingual retrieval and domain adaptation, with potential applications in numerous fields reliant on accurate information retrieval.
In conclusion, DuReaderretrieval is a pivotal contribution to the landscape of passage retrieval that encourages deeper inquiry into both the linguistic intricacies of Chinese and the broader challenges of multilingual and cross-domain retrieval tasks.