RACE: Large-scale ReAding Comprehension Dataset From Examinations (1704.04683v5)

Published 15 Apr 2017 in cs.CL, cs.AI, and cs.LG

Abstract: We present RACE, a new dataset for benchmark evaluation of methods in the reading comprehension task. Collected from the English exams for middle and high school Chinese students in the age range between 12 to 18, RACE consists of near 28,000 passages and near 100,000 questions generated by human experts (English instructors), and covers a variety of topics which are carefully designed for evaluating the students' ability in understanding and reasoning. In particular, the proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models (43%) and the ceiling human performance (95%). We hope this new dataset can serve as a valuable resource for research and evaluation in machine comprehension. The dataset is freely available at http://www.cs.cmu.edu/~glai1/data/race/ and the code is available at https://github.com/qizhex/RACE_AR_baselines.

PDF Abstract

Examination of RACE: A Large-Scale Reading Comprehension Dataset

This essay provides an expert overview of the paper "RACE: Large-scale ReAding Comprehension Dataset From Examinations" authored by Guokun Lai et al. The paper introduces RACE, an extensive dataset curated for benchmarking methodologies within the reading comprehension domain of NLP. The dataset is compiled from English examinations meant for middle and high school students in China, crafted by English instructors to assess student comprehension and reasoning.

Introduction

NLP research has made strides in deep learning, raising the possibility of developing systems that match human performance in tasks requiring language comprehension. Part of this effort includes constructing datasets that accurately evaluate machine comprehension systems. Existing datasets often suffer from limitations like answer predictability through simple word-based matching and the quality of crowd-sourced or automatically generated questions.

Dataset Overview

RACE addresses these limitations by providing a data pool of 27,933 passages and 97,687 questions extracted from students' examinations. The questions in RACE demand a higher level of reasoning compared to other datasets because they are designed for educational evaluation. This aspect is evidenced by the substantial gap between state-of-the-art model performance (43%) and human performance (95%).

Unique Features and Contributions

Higher Proportion of Reasoning Questions

RACE distinguishes itself by containing a larger proportion of questions requiring multi-sentence reasoning compared to datasets like CNN/Daily Mail, SQUAD, and NEWSQA. The paper details human annotations illustrating that 25.8% of questions in RACE need multi-sentence reasoning, compared to only 2.0% in CNN/Daily Mail.

Diverse Question Types

Another unique attribute is the variety of reasoning types covered by RACE, such as passage summarization and attitude analysis, which are underrepresented in existing large-scale datasets. These questions are designed to assess not just literal comprehension but more nuanced understanding, such as author viewpoint and holistic passage synthesis.

Broad Topic and Style Coverage

The passages in RACE span multiple domains and styles, diverging from datasets that focus predominantly on specific contexts or domains, such as news articles or fictional stories. This diversity makes RACE valuable for testing the general reading comprehension capabilities of systems.

Experimental Results

The paper's experimental section evaluates several prominent models on the RACE dataset:

Sliding Window Algorithm: Achieved 32.2% accuracy, marking its limitation in handling questions requiring deeper reasoning.
Stanford Attentive Reader (Stanford AR): This model scored 43.3%, indicating some capability in comprehension but still far below human performance.
Gated-Attention Reader (GA): Achieved 44.1%, slightly better than Stanford AR, but still significantly trailing the ideal human performance.

Human vs. Machine Performance

Human performance on RACE, as measured by qualified crowd-workers, is 73.3%, with the upper bound (ceiling) performance being 94.5%. This highlights substantial room for improvement in current NLP systems to meet or exceed human levels in reading comprehension tasks.

Implications and Future Directions

The introduction of RACE as a benchmark dataset is pivotal for several reasons. It underscores the complexity of developing comprehension models that can address nuanced and inferential questions effectively. Given the significant gap between current model performance and human capabilities, future research can focus on:

Reasoning Enhancement: Improving models' multi-sentence reasoning ability.
Broader Comprehension Skills: Developing systems to handle diverse question types, such as attitude analysis and summarization.
Robust Evaluation Mechanisms: Leveraging RACE's comprehensive coverage to stress-test models over varied text styles and domains.

Conclusion

The RACE dataset sets a new standard for the evaluation of machine comprehension tasks. Its emphasis on human-like reasoning and diversity in content provides a stringent benchmark for future NLP research. The significant performance gap identified by the authors between state-of-the-art models and human performance provides a clear directive for future advancements in AI reading comprehension systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Guokun Lai (16 papers)
Qizhe Xie (15 papers)
Hanxiao Liu (35 papers)
Yiming Yang (151 papers)
Eduard Hovy (115 papers)

Citations (1,226)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos