A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (2004.03116v2)

Published 7 Apr 2020 in cs.CL

Abstract: Owing to the continuous efforts by the Chinese NLP community, more and more Chinese machine reading comprehension datasets become available. To add diversity in this area, in this paper, we propose a new task called Sentence Cloze-style Machine Reading Comprehension (SC-MRC). The proposed task aims to fill the right candidate sentence into the passage that has several blanks. We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task. Moreover, to add more difficulties, we also made fake candidates that are similar to the correct ones, which requires the machine to judge their correctness in the context. The proposed dataset contains over 100K blanks (questions) within over 10K passages, which was originated from Chinese narrative stories. To evaluate the dataset, we implement several baseline systems based on the pre-trained models, and the results show that the state-of-the-art model still underperforms human performance by a large margin. We release the dataset and baseline system to further facilitate our community. Resources available through https://github.com/ymcui/cmrc2019

PDF Abstract

An Analytical Overview of "A Sentence Cloze Dataset for Chinese Machine Reading Comprehension"

The paper "A Sentence Cloze Dataset for Chinese Machine Reading Comprehension" introduces a novel dataset and task formulation aimed at advancing the field of machine reading comprehension (MRC) in Chinese. This work makes a significant contribution by focusing on sentence-level inference, a complexity that is not extensively addressed in existing datasets predominantly centered around token-level or span-level inference.

Task Definition and Dataset Overview

The authors propose a Sentence Cloze-style Machine Reading Comprehension (SC-MRC) task. This task involves filling the correct candidate sentence into a passage at designated blank locations. The constructed dataset, CMRC 2019, contains over 100,000 blanks within more than 10,000 passages extracted from Chinese narrative stories. A critical feature of this dataset is the inclusion of fake candidates—sentences that are contextually similar to the correct options—to increase the discrimination difficulty for the machine. This highlights the requirement for models to perform complex reasoning and contextual judgment, beyond simple matching from a candidate pool.

Methodological Approaches

The SC-MRC task involves a unique setup where sentence selection and candidate generation are processed using syntactic techniques. Specifically, sentences are identified and split using the Language Technology Platform (LTP), ensuring contextual integrity and appropriate difficulty. Fake candidates are generated by selecting sentences from contiguous narrative contexts outside the examined passage, thereby maintaining relevance while challenging sentence discrimination capabilities.

Baseline Evaluation

Baseline systems were implemented using pre-trained LLMs, notably BERT and its variations, focusing on whole word masking strategies to enhance contextual understanding. The models were assessed using two metrics: Question-level Accuracy (QAC) and Passage-level Accuracy (PAC). The QAC measure indicates how frequently individual blanks were correctly filled, while PAC evaluates the proportion of entire passages correctly completed.

Results and Implications

Initial results exhibit that these baseline models significantly underperform compared to human benchmarks, particularly in PAC, underscoring the complexity of sentence-level comprehension required by this dataset. The best-performing models, such as RoBERTa-wwm-ext-large, still show a sizeable performance gap against human competency, indicating the dataset's effectiveness in testing sophisticated reasoning skills of AI models.

Future Directions

The introduction of CMRC 2019 offers a profound opportunity for advancing Chinese MRC research by setting a new benchmark for evaluating sentence-level comprehension. This dataset challenges both model architecture and training paradigms to incorporate more nuanced reasoning strategies. Future research might explore improved model architectures, such as those integrating passage coherence as a learning objective, or the use of unsupervised pre-training strategies that better capture narrative structures.

In sum, the paper lays essential groundwork for future explorations into machine reading comprehension at the sentence level, stimulating avenues for innovation within the natural language processing community. The implications of such work are far-reaching, potentially improving applications ranging from intelligent tutoring systems to automated summarization and question-answering systems with enhanced inferential capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yiming Cui (80 papers)
Ting Liu (329 papers)
Ziqing Yang (29 papers)
Zhipeng Chen (46 papers)
Wentao Ma (35 papers)
Wanxiang Che (152 papers)
Shijin Wang (69 papers)
Guoping Hu (39 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ymcui/cmrc2019: A Sentence Cloze Dataset for Chinese Machine Reading Comprehension (CMRC 2019) (125 stars)