Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning (1908.05803v2)

Published 16 Aug 2019 in cs.CL

Abstract: Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark---the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pradeep Dasigi (29 papers)
  2. Nelson F. Liu (19 papers)
  3. Ana Marasović (27 papers)
  4. Noah A. Smith (224 papers)
  5. Matt Gardner (57 papers)
Citations (168)

Summary

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

The paper introduces Quoref, a novel dataset aimed at advancing the evaluation of reading comprehension models by focusing on coreferential reasoning. This dataset addresses a critical gap observed in existing benchmarks which largely emphasize the understanding of local predicate-argument structures. Such datasets inadequately assess the models’ aptitude for comprehending and resolving coreferences, a capability vital for processing longer text passages.

Quoref is composed of over 24,000 span-selection questions derived from Wikipedia paragraphs. A distinctive aspect of this dataset is the requirement for resolving coreferences to accurately answer the questions. To ensure that these questions truly probe coreferential reasoning, an adversarial baseline model was incorporated into the crowdsourcing process. This model, derived from a strong baseline system, challenges annotators to craft questions that avoid surface-level cues easily exploitable by reading comprehension systems.

The evaluation of current reading comprehension models on the Quoref dataset reveals significant shortcomings. The best performing model achieved a 70.5 F1 score, starkly contrasted against an estimated human performance of 93.4 F1. This discrepancy underscores the challenge Quoref presents to reading comprehension models and highlights the need for improved methodologies that incorporate coreferential reasoning.

Dataset Construction and Analysis

The construction of Quoref involved scraping diverse English Wikipedia articles spanning topics such as history, geography, and film. The subsequent crowdsourcing task instructed workers to identify co-referring spans and generate related questions. Crucially, candidate questions had to withstand attempts to be correctly answered by the adversarial model in real-time, ensuring that the designed questions necessitated genuine comprehension of the coreferential phenomena.

A manual analysis of a subset of Quoref questions confirmed that approximately 78% could not be answered without resolving coreferences. This analysis also identified types of reasoning involved: pronominal resolution, nominal resolution, combinations of the two, and in some cases, commonsense reasoning.

Implications and Path Forward

Quoref stands as an impactful contribution that paves the way for the next generation of reading comprehension benchmarks. Its focus on coreferential reasoning challenges present models and emphasizes areas where state-of-the-art systems fall short against human performance.

The implications for future AI developments are substantial. Reading comprehension models must be refined to adeptly process coreferential information, a development crucial for tasks involving narrative text comprehension, dialogue systems, and information retrieval from unstructured text. As models evolve, integrating enhanced context representation and deeper understanding of discourse phenomena will likely influence advancements in natural language processing and artificial intelligence.

In conclusion, Quoref is poised to serve as a robust benchmark challenging current techniques and fostering future innovations in reading comprehension research. Researchers are encouraged to engage with Quoref to develop more sophisticated models capable of true textual comprehension, ultimately bridging the performance gap observed between current models and human baseline understanding.