DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Published 14 Nov 2017 in cs.CL | (1711.05073v4)

Abstract: This paper introduces DuReader, a new large-scale, open-domain Chinese ma- chine reading comprehension (MRC) dataset, designed to address real-world MRC. DuReader has three advantages over previous MRC datasets: (1) data sources: questions and documents are based on Baidu Search and Baidu Zhidao; answers are manually generated. (2) question types: it provides rich annotations for more question types, especially yes-no and opinion questions, that leaves more opportunity for the research community. (3) scale: it contains 200K questions, 420K answers and 1M documents; it is the largest Chinese MRC dataset so far. Experiments show that human performance is well above current state-of-the-art baseline systems, leaving plenty of room for the community to make improvements. To help the community make these improvements, both DuReader and baseline systems have been posted online. We also organize a shared competition to encourage the exploration of more models. Since the release of the task, there are significant improvements over the baselines.

Abstract PDF Upgrade to Chat

Citations (263)

View on Semantic Scholar

Summary

The paper introduces DuReader, a groundbreaking Chinese MRC dataset featuring 200K questions, 420K answers, and 1M documents from real-world sources.
It leverages search logs from Baidu and community Q&A to include diverse question types such as yes-no and opinion-based queries.
Experimental results reveal that baseline models significantly underperform human benchmarks, emphasizing the need for novel multi-document summarization and opinion extraction techniques.

Overview of DuReader: A Chinese Machine Reading Comprehension Dataset

The paper presents DuReader, an extensive Chinese Machine Reading Comprehension (MRC) dataset aimed at advancing research in real-world MRC applications. DuReader is introduced with three notable features that differentiate it from prior MRC datasets: diverse data sources, a wide range of question types, and a larger scale. The dataset comprises 200,000 questions, 420,000 answers, and 1 million documents, which marks it as the largest Chinese MRC dataset available to date.

Key Contributions

Data Sources: DuReader leverages data from Baidu Search and Baidu Zhidao, effectively combining search logs and community-based question answering content. This combination provides a more realistic set of questions and documents compared to other datasets that rely heavily on synthetic data or crowdsourced questions from limited sources.
Question Types: The dataset includes a broader array of question types, specifically incorporating yes-no and opinion-based questions, which prior datasets have often neglected. This variety presents a significant challenge for MRC systems, promoting the development of models capable of dealing with more complex question formats beyond simple fact-based queries.
Scale: As highlighted, DuReader's substantial scale encompasses 200,000 questions alongside extensively annotated answer sets, reinforcing its utility for training and evaluating models on a diverse range of queries.

Experimental Results and Insights

The paper demonstrates the performance of baseline systems on DuReader using well-established MRC models such as Match-LSTM and BiDAF. While these models showed strong performance improvements over simplistic baselines, they still fall short of human performance by a considerable margin. This discrepancy indicates the substantial complexity and challenge posed by DuReader, especially compared to datasets where state-of-the-art models have attained performances close to or surpassing human benchmarks.

In particular, the paper observes significant challenges in handling yes-no and opinion-based questions, pointing out the limitations of current span selection methods that are inadequate for summarizing answers based on multiple document sources. Furthermore, the document length and the coverage of multiple paragraphs per document present novel challenges in paragraph selection, which the current systems have not fully addressed, demanding more sophisticated paragraph ranking and selection strategies.

Future Implications

The paper suggests several directions for future research:

Development of Novel Models: New algorithms and architectures could be devised to tackle the extensive range of question types and the comprehensive nature of the dataset. Particular attention should be paid to multi-document summarization and opinion recognition, which are less explored in existing MRC frameworks.
Enhanced Evaluation Metrics: The study proposes an innovative opinion-aware evaluation method for better assessing systems on yes-no questions, encouraging further exploration in developing metrics that capture the subtleties and complexities posed by real-world data.
Dataset Expansion: Future iterations of DuReader could incorporate additional annotations, such as opinion tagging for all types of questions, further enriching the dataset's utility.

The paper also notes a shared task organized to foster community engagement and stimulate progress in MRC research. The significant improvements observed since the release of the task underscore the dataset's potential to drive advances in building robust and comprehensive Chinese MRC systems.

In conclusion, DuReader takes a pivotal step in MRC research. By offering a dataset with broad coverage and realistic queries, it provides researchers with an invaluable resource to develop more capable and nuanced models, setting a new standard for MRC tasks within the Chinese language context.

Markdown