Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
The paper introduces Quoref, a novel dataset aimed at advancing the evaluation of reading comprehension models by focusing on coreferential reasoning. This dataset addresses a critical gap observed in existing benchmarks which largely emphasize the understanding of local predicate-argument structures. Such datasets inadequately assess the models’ aptitude for comprehending and resolving coreferences, a capability vital for processing longer text passages.
Quoref is composed of over 24,000 span-selection questions derived from Wikipedia paragraphs. A distinctive aspect of this dataset is the requirement for resolving coreferences to accurately answer the questions. To ensure that these questions truly probe coreferential reasoning, an adversarial baseline model was incorporated into the crowdsourcing process. This model, derived from a strong baseline system, challenges annotators to craft questions that avoid surface-level cues easily exploitable by reading comprehension systems.
The evaluation of current reading comprehension models on the Quoref dataset reveals significant shortcomings. The best performing model achieved a 70.5 F1 score, starkly contrasted against an estimated human performance of 93.4 F1. This discrepancy underscores the challenge Quoref presents to reading comprehension models and highlights the need for improved methodologies that incorporate coreferential reasoning.
Dataset Construction and Analysis
The construction of Quoref involved scraping diverse English Wikipedia articles spanning topics such as history, geography, and film. The subsequent crowdsourcing task instructed workers to identify co-referring spans and generate related questions. Crucially, candidate questions had to withstand attempts to be correctly answered by the adversarial model in real-time, ensuring that the designed questions necessitated genuine comprehension of the coreferential phenomena.
A manual analysis of a subset of Quoref questions confirmed that approximately 78% could not be answered without resolving coreferences. This analysis also identified types of reasoning involved: pronominal resolution, nominal resolution, combinations of the two, and in some cases, commonsense reasoning.
Implications and Path Forward
Quoref stands as an impactful contribution that paves the way for the next generation of reading comprehension benchmarks. Its focus on coreferential reasoning challenges present models and emphasizes areas where state-of-the-art systems fall short against human performance.
The implications for future AI developments are substantial. Reading comprehension models must be refined to adeptly process coreferential information, a development crucial for tasks involving narrative text comprehension, dialogue systems, and information retrieval from unstructured text. As models evolve, integrating enhanced context representation and deeper understanding of discourse phenomena will likely influence advancements in natural language processing and artificial intelligence.
In conclusion, Quoref is poised to serve as a robust benchmark challenging current techniques and fostering future innovations in reading comprehension research. Researchers are encouraged to engage with Quoref to develop more sophisticated models capable of true textual comprehension, ultimately bridging the performance gap observed between current models and human baseline understanding.