BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions (1905.10044v1)

Published 24 May 2019 in cs.CL

Abstract: In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained LLMs such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Citations (1,191)

View on Semantic Scholar

Summary

The paper presents the BoolQ dataset of 16,000 natural yes/no question-passage pairs, highlighting the challenging inferential reasoning required.
It shows through transfer learning experiments that pre-training on MultiNLI significantly boosts performance, achieving an 80.43% accuracy on the test set.
The research underscores the need for advanced pre-training methods to capture implicit reasoning, paving the way for future NLU improvements.

Exploring the Surprising Difficulty of Natural Yes/No Questions: An Examination of BoolQ

The paper "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions" introduces a novel dataset, BoolQ, which is designed to test machine learning models on their ability to answer naturally occurring, binary (yes/no) questions based on text passages. This dataset fills a notable gap in evaluating natural language understanding (NLU) systems, especially in tasks requiring a high degree of inferential reasoning. The authors provide a comprehensive analysis of the dataset's complexity and demonstrate its robustness through extensive experimentation with various transfer learning approaches.

Dataset Construction

BoolQ is composed of 16,000 question-passage pairs derived from natural queries to the Google search engine. These queries were manually filtered to ensure they were comprehensible and unambiguous yes/no questions. Each question is paired with a relevant passage from Wikipedia that annotators identified as containing sufficient information to deduce the answer. The dataset splits into 9,427 training, 3,270 development, and 3,270 test examples.

Analytical Insights

The paper's analysis reveals the intrinsic difficulty of the BoolQ dataset. A significant finding is that many questions require complex inferential reasoning. For instance, only 38.7% of the questions can be answered through simple paraphrasing of the content in the passage. The remaining questions necessitate reasoning based on implicit information, world knowledge, and factual reasoning, highlighting the advanced comprehension capabilities required of the models.

Transfer Learning Approaches

Given the challenging nature of BoolQ, the authors explore a series of transfer learning methods to improve model performance. They evaluate the efficacy of transferring from various datasets and NLP tasks, including entailment (MultiNLI, SNLI), multiple-choice QA (RACE), and extractive QA (SQuAD 2.0, QNLI). Their results indicate that pre-training on MultiNLI yields the highest boost in performance. Additionally, the combination of unsupervised pre-training models (such as BERT and OpenAI GPT) with MultiNLI pre-training leads to remarkable performance gains.

Empirical Results

The best-performing model achieves an accuracy of 80.43% on the BoolQ test set, which underscores the difficulty of this task given that human annotators achieve 90% accuracy. Notably, pre-training on MultiNLI in conjunction with fine-tuning on BoolQ proved significantly beneficial, even when starting from powerful pre-trained LLMs like BERT. The high performance of MultiNLI-trained models suggests that datasets geared towards textual entailment, especially those including contradiction detection, are more effective at transferring to yes/no question answering tasks.

Practical and Theoretical Implications

The results underscore the importance of robust pre-training tasks in enhancing NLU models' capabilities, specifically for tasks requiring inferential understanding. BoolQ provides a valuable benchmark for assessing the progress in this domain, pushing for advancements in capturing intricate logical and factual relationships in text. Future developments may focus on scaling the dataset or incorporating more diverse sources of natural yes/no questions to further challenge and refine NLU models.

Conclusion

The BoolQ dataset's introduction represents a significant step toward understanding and addressing the complexities inherent in natural yes/no question answering. The paper convincingly demonstrates that existing pre-training strategies, particularly those involving entailment datasets, are crucial for improving model performance on this challenging task. The insights gained from this research pave the way for future exploration into more sophisticated pre-training and fine-tuning methodologies, potentially advancing the frontier of NLU technologies.