Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Span-Extraction Dataset for Chinese Machine Reading Comprehension (1810.07366v2)

Published 17 Oct 2018 in cs.CL

Abstract: Machine Reading Comprehension (MRC) has become enormously popular recently and has attracted a lot of attention. However, the existing reading comprehension datasets are mostly in English. In this paper, we introduce a Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context. We present several baseline systems as well as anonymous submissions for demonstrating the difficulties in this dataset. With the release of the dataset, we hosted the Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018). We hope the release of the dataset could further accelerate the Chinese machine reading comprehension research. Resources are available: https://github.com/ymcui/cmrc2018

A Span-Extraction Dataset for Chinese Machine Reading Comprehension

The paper presents a novel span-extraction dataset tailored specifically for Chinese machine reading comprehension (MRC). Recognizing the predominance of English-based datasets in MRC research, this work aims to foster linguistic diversity by introducing a comprehensive dataset specifically for Chinese, annotated by human experts.

Key Contributions

The dataset comprises about 20,000 questions based on Chinese Wikipedia, meticulously annotated to ensure relevance and accuracy. Importantly, it includes a distinct challenge set designed to evaluate the capability of MRC systems to perform reasoning over multiple sentences within a passage. This dataset diverges from existing cloze-style datasets by necessitating direct extraction of answer spans from passages, modeled after the SQuAD dataset.

Methodology

The dataset development involved:

  • Data Collection: Extracting Chinese text from Wikipedia and converting it into a structured, passage-question-answer format.
  • Human Annotation: Careful curation by annotators, who adhered to rules ensuring linguistic suitability and annotator consistency.
  • Challenge Set: Designed to test models beyond basic comprehension, requiring inference across sentences, a task where BERT-based models show a marked performance challenge.

Evaluation and Results

The paper evaluates various baseline models, including a BERT-based Chinese LLM. Notably, BERT-based approaches reported robust performance across standard and challenge datasets, though a decrease in F1 scores was evident when moving to complex questions.

Evaluation metrics included Exact Match (EM) and F1-score, with the dataset revealing difficulties MRC systems face in managing Chinese text. The EM scores across datasets were significantly lower than F1, highlighting the challenge of span boundary detection in Chinese text comprehension.

Implications and Future Directions

This dataset extends the scope of MRC research to include varied languages, offering both a resource for cross-linguistic studies and a basis for developing systems resilient to complex comprehension tasks. Bridging performance gaps observed in challenge sets could guide future model enhancements, focusing on multi-sentence reasoning and improved language understanding.

Conclusion

The CMRC 2018 dataset presents a substantial resource for evolving MRC research into a multi-lingual endeavor. Its structured annotation and the inclusion of challenge sets set a new benchmark for evaluating comprehension capabilities, propelling advancements in AI approaches to understanding and generating nuanced language responses. As researchers engage with this dataset, it may prompt developments in model architecture and training methodologies, especially for languages with less prevalent but critical representation needs in AI research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yiming Cui (80 papers)
  2. Ting Liu (329 papers)
  3. Wanxiang Che (152 papers)
  4. Li Xiao (85 papers)
  5. Zhipeng Chen (46 papers)
  6. Wentao Ma (35 papers)
  7. Shijin Wang (69 papers)
  8. Guoping Hu (39 papers)
Citations (174)
Github Logo Streamline Icon: https://streamlinehq.com