Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine (2203.10232v4)

Published 19 Mar 2022 in cs.CL and cs.IR

Abstract: In this paper, we present DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. DuReader_retrieval contains more than 90K queries and over 8M unique passages from a commercial search engine. To alleviate the shortcomings of other datasets and ensure the quality of our benchmark, we (1) reduce the false negatives in development and test sets by manually annotating results pooled from multiple retrievers, and (2) remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a set of human translated queries for for cross-lingual retrieval evaluation. The experiments demonstrate that DuReader_retrieval is challenging and a number of problems remain unsolved, such as the salient phrase mismatch and the syntactic mismatch between queries and paragraphs. These experiments also show that dense retrievers do not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader_retrieval is publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yifu Qiu (12 papers)
  2. Hongyu Li (107 papers)
  3. Yingqi Qu (11 papers)
  4. Ying Chen (333 papers)
  5. Qiaoqiao She (9 papers)
  6. Jing Liu (526 papers)
  7. Hua Wu (191 papers)
  8. Haifeng Wang (194 papers)
Citations (12)

Summary

DuReaderretrieval_{\bf retrieval}: A Comprehensive Chinese Dataset for Passage Retrieval

The paper introduces DuReaderretrieval_{\bf retrieval}, an extensive Chinese dataset curated for the evaluation and benchmarking of passage retrieval systems. This dataset includes over 90,000 queries and 8 million unique passages sourced from a commercial search engine, notably Baidu. The creation of DuReaderretrieval_{\bf retrieval} is a response to the limitations observed in current datasets, particularly those geared towards non-English languages. Unlike other datasets which suffer from small scale or machine-generated queries, DuReaderretrieval_{\bf retrieval} is human-annotated, providing a more reliable basis for model training and evaluation.

Improvements and Features

DuReaderretrieval_{\bf retrieval} distinguishes itself through several key improvements over its predecessors:

  1. Manual Annotation for Quality Assurance: The development and test sets have been meticulously annotated to minimize false negatives, a prevalent issue in many large-scale datasets due to limited human annotation presence.
  2. Exclusion of Overlapping Queries: To address potential leaks of testing information, semantically similar queries between training and testing datasets have been identified and excluded using sophisticated query matching models.
  3. Cross-Domain and Cross-Lingual Evaluations: The dataset not only offers primary testing sets but also includes two domain-specific testing sets for out-of-domain evaluation, as well as a set of human-translated queries for assessing cross-lingual retrieval capabilities.

Experimental Findings

Experiments conducted with DuReaderretrieval_{\bf retrieval} highlight significant challenges in current retrieval paths, including the mismatch of salient phrases and syntactic variations between queries and passages. Dense retrievers, while effective within domain, show poor generalization across domains and struggle with cross-lingual retrieval tasks, underscoring the persistent challenges in achieving truly flexible retrieval systems.

Comparative Analysis

The dataset's scale and manual refinement position it as a formidable benchmark for Chinese-language passage retrieval, filling a crucial gap left by prior datasets such as TianGong-PDR and Sougou-QCL, which lack comprehensive human annotation or are limited by size. DuReaderretrieval_{\bf retrieval} shares similarities with prominent English datasets like MS-MARCO and Natural Questions, yet marks substantial advancements tailored for Chinese retrieval demands.

Implications for Future Research

The implications of DuReaderretrieval_{\bf retrieval} are multifaceted. Practically, it offers a strong foundation for constructing more accurate and contextually aware Chinese retrievers. Theoretically, it challenges assumptions held in transfer learning and cross-lingual adaptation, pushing for advancements in these areas.

As models continue to evolve, DuReaderretrieval_{\bf retrieval} provides an essential platform for testing the limits and capabilities of retrieval algorithms, paving the way for more sophisticated, versatile, and domain-agnostic retrieval systems. This dataset sets the stage for future explorations in cross-lingual retrieval and domain adaptation, with potential applications in numerous fields reliant on accurate information retrieval.

In conclusion, DuReaderretrieval_{\bf retrieval} is a pivotal contribution to the landscape of passage retrieval that encourages deeper inquiry into both the linguistic intricacies of Chinese and the broader challenges of multilingual and cross-domain retrieval tasks.

Github Logo Streamline Icon: https://streamlinehq.com