Know What You Don't Know: Unanswerable Questions for SQuAD (1806.03822v1)

Published 11 Jun 2018 in cs.CL

Abstract: Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.

PDF Abstract

Know What You Don't Know: Unanswerable Questions for SQuAD

Overview

The paper "Know What You Don't Know: Unanswerable Questions for SQuAD," authored by Pranav Rajpurkar, Robin Jia, and Percy Liang from Stanford University, addresses a critical issue in the domain of reading comprehension datasets—specifically, the challenge posed by unanswerable questions.

The Stanford Question Answering Dataset (SQuAD) has been a benchmark in reading comprehension research. However, the existing version primarily focuses on answerable questions. The authors identify the limitation of this approach, as real-world applications of question-answering (QA) systems necessitate the ability to recognize and handle unanswerable questions. The paper proposes an augmentation to the SQuAD dataset, named SQuAD 2.0, which incorporates over 50,000 unanswerable questions. These unanswerable questions are crafted to closely resemble answerable ones, thereby providing a more rigorous test for QA systems.

Problem Statement

The primary problem addressed in the paper is the ability of QA systems to not only generate correct answers to questions but also to determine when no answer is available. This capability is crucial for practical applications where the cost of providing incorrect or misleading information is high. The evaluation of such systems has predominantly focused on answerable questions, thereby leaving a gap in comprehensive assessment metrics.

Methodology

To bridge this gap, the authors extend the SQuAD dataset by adding a substantial number of carefully designed unanswerable questions. The unanswerability of these questions stems from their similarity in structure and phrasing to existing answerable ones, thus posing a significant challenge to QA systems that leverage pattern recognition and context matching.

In terms of the experimental setup, the authors employ various state-of-the-art models, including BiDAF++ and a BERT-based model, to evaluate performance on the extended dataset. The primary metrics for evaluation include Exact Match (EM) and F1 score, adapted to account for the necessity of predicting "no answer" when applicable.

Numerical Results

The empirical results demonstrate a marked drop in performance metrics when the QA systems are tested on SQuAD 2.0 compared to the original SQuAD. For instance, the BERT-based model, which exhibited an F1 score of 93.2 on SQuAD, recorded a significantly lower F1 score of 86.9 on SQuAD 2.0. Similarly, the EM score dropped from 87.4 to 78.7. These results illustrate the increased difficulty and the limitations of current QA models in handling the additional complexity introduced by unanswerable questions.

Implications and Future Work

The implications of this research are twofold. Practically, it underscores the necessity for QA systems deployed in real-world scenarios to incorporate mechanisms that can reliably identify unanswerable questions. This has direct applications in fields such as automated customer service, medical information retrieval, and legal document analysis, where inaccurate answers can have significant consequences.

Theoretically, the introduction of SQuAD 2.0 paves the way for novel research directions focusing on the robustness and reliability of QA models. Future developments could explore more sophisticated techniques for uncertainty estimation, ensemble methods to cross-verify answerability, and hybrid models integrating rule-based approaches with neural networks to enhance unanswerability detection.

The authors' contribution of SQuAD 2.0 represents a vital step towards more comprehensive evaluation frameworks in QA research, enabling the development of systems better suited to real-world complexities. Subsequent research efforts are anticipated to build upon this foundation, further advancing the state of the art in reliable and accurate question answering systems.