Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge (2102.03315v1)

Published 5 Feb 2021 in cs.CL and cs.AI

Abstract: We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex question-answering by the community. ARC-DA is available at https://allenai.org/data/arc-da

PDF Abstract

Overview of ARC-DA: A Direct-Answer Dataset for AI2 Reasoning Challenge

The paper presents the ARC-DA dataset, a direct-answer version of the well-known AI2 Reasoning Challenge (ARC) dataset. While the original ARC dataset employs a multiple-choice format, the ARC-DA dataset transitions these questions to a direct-answer format. The goal of this transformation is to better simulate real-world question-answering scenarios where questions typically do not come with pre-defined answer options.

Methodology

The conversion of multiple-choice questions from ARC to direct-answer format was methodically executed using a blend of crowdsourcing and expert review, resulting in a dataset comprising 2985 questions and 8436 valid answers. This dual strategy was essential in ensuring the quality and variety of the dataset. The crowdsourcing component involved filtering inappropriate questions and gathering multiple valid answers per question. Following this, an in-house expert review was conducted to further refine the dataset, ensuring clarity and comprehensiveness of both questions and answers.

Evaluation and Baseline Performance

The paper outlines the evaluation metrics appropriate for the ARC-DA challenges. Due to the open-ended nature of direct-answer question sets, evaluation required both human judgment via the GENIE framework and automated metrics such as ROUGE-L and F1. Baseline experiments, using T5 and UnifiedQA models, revealed strong yet improvable scores, with the UnifiedQA + ARC-DA/MC model achieving top scores of 81% on GENIE and providing new insights into the symbiosis between direct-answer and multiple-choice question formats.

Implications and Future Directions

ARC-DA presents significant implications for the field of natural language processing and AI. It offers a benchmark to test the reasoning prowess of AI models in a direct-question-answering paradigm, pushing beyond simplistic retrieval-based approaches to require deeper language understanding and logical reasoning. The dataset provides valuable opportunities for research into complex QA tasks involving explanation generation and knowledge synthesis.

Furthermore, the achievement of state-of-the-art results on the original ARC dataset using ARC-DA-trained models underscores the potential benefits of hybrid training approaches combining direct-answer and multiple-choice formats. This reciprocity hints at new methodologies for enhancing AI systems' capabilities in understanding and responding to natural language queries.

Conclusion

ARC-DA bridges a critical gap in QA research by providing a dataset that aligns QA model training closer to real-world scenarios. It sets the stage for further advances in natural language understanding, reasoning, and explanation generation. Researchers are encouraged to engage with ARC-DA not only to tackle the challenges it presents but to contribute to the evolution of robust, intelligent systems capable of human-like reasoning and understanding. The dataset, available via the Allen Institute, promises to be a valuable resource for the continued development and assessment of sophisticated AI models.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Sumithra Bhakthavatsalam (5 papers)
Daniel Khashabi (83 papers)
Tushar Khot (53 papers)
Bhavana Dalvi Mishra (26 papers)
Kyle Richardson (44 papers)
Ashish Sabharwal (84 papers)
Carissa Schoenick (8 papers)
Oyvind Tafjord (49 papers)
Peter Clark (108 papers)

Citations (53)

View on Semantic Scholar