Constructing a Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
The paper presents a novel multi-hop question answering (QA) dataset named 2WikiMultiHopQA, designed to enhance the evaluation of reasoning capabilities in machine reading comprehension systems. The dataset aims to address inadequacies in existing multi-hop datasets which often do not necessitate true multi-hop reasoning or lack comprehensive explanations for the reasoning processes involved.
2WikiMultiHopQA distinguishes itself by integrating both structured data from Wikidata and unstructured data from Wikipedia. This integration is strategically utilized to create evidence-paths that explicitly detail the reasoning steps from questions to answers. This design facilitates twofold advantages: it provides transparency in model predictions and serves as a rigorous basis for evaluating reasoning skills.
The dataset generation employs a systematic pipeline with predefined templates to maintain the quality and ensure multi-hop reasoning requirements are met. Specifically, logical rules extracted from knowledge graphs are utilized to construct questions that inherently demand multi-hop reasoning, thus challenging the models beyond simple factual retrieval.
Experimental analysis underscores the dataset's complexity. It demonstrates significant difficulty levels when benchmarked against other datasets such as HotpotQA, observed through lower performance metrics for models tested on 2WikiMultiHopQA while maintaining comparable human performance levels. This indicates that 2WikiMultiHopQA includes a higher proportion of genuinely challenging multi-hop questions.
The inclusion of a third task—evidence generation—adds a layer to the assessment of reasoning that goes beyond answer prediction and sentence-level supporting fact identification. This is particularly notable in the areas of inferential logic and relation extraction, where the dataset challenges models to reconstruct structured data relations for correct reasoning paths.
Analysis of question types reveals strategic balancing across comparison, inference, compositional, and bridge-comparison questions, thus fostering diverse reasoning challenges. The data curation process carefully avoids single-hop traps and ambiguous questions to ensure clarity and focus on genuine multi-hop reasoning.
Implications of this research are manifold. Practically, it provides a robust benchmark for developing advanced multi-hop reasoning models, potentially steering improvements in QA systems' ability to process and reason through complex data associations. Theoretically, it enriches our understanding of the limitations and capabilities of current AI models in extracting, synthesizing, and reasoning over multi-faceted information.
Looking forward, 2WikiMultiHopQA could serve as a foundational tool for future research exploring refined reasoning strategies and the integration of more nuanced logic within AI models. Its structured yet comprehensive approach serves as a template for constructing QA datasets that aspire to rigorously test and advance multi-hop reasoning capabilities in AI systems. The paper's contribution lies in setting a new standard for the complexity and transparency expected in multi-hop QA datasets.