Constructing Datasets for Multi-hop Reading Comprehension Across Documents
This paper by Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel discusses the development of datasets tailored for evaluating multi-hop reading comprehension (RC) models across multiple documents. The research addresses the limitations of current RC methodologies, which predominantly focus on answering queries using localized information from a single sentence or document. By encouraging models to synthesize disjoint textual evidence, the paper aims to advance the scope of machine comprehension techniques.
Introduction and Motivation
Traditional RC tasks often limit the retrieval of information to single-source queries. This research introduces a novel task emphasizing multi-hop inference, requiring models to seek and combine evidence from multiple documents. This task is motivated by practical applications where critical information is distributed across various sources, such as scientific literature or encyclopedic knowledge bases.
Methodology and Dataset Construction
The paper presents a methodology for constructing datasets that enable the aforementioned multi-hop capabilities. The authors generate these datasets by using thematically linked documents from existing collections of query-answer pairs. Specifically, two datasets are derived for different domains:
- WikiHop: Leveraging the interconnected articles within Wikipedia and the structured data from Wikidata. The dataset focuses on queries about entities where no single document holds the complete answer.
- MedHop: Using abstracts from Medline and facts from DrugBank, the dataset addresses drug-drug interaction queries requiring inferences drawn from multiple scientific findings.
A bipartite graph methodology is introduced to build these datasets, linking documents to entities based on their mentions and thematic relevance. The traversal of this graph ensures that evidence is collected across documents, necessitating multi-hop reasoning.
Addressing Challenges and Biases
Dataset construction is susceptible to biases and inaccuracies. The paper highlights potential issues such as candidate frequency imbalance and document-answer correlations. The authors introduce filtering strategies to mitigate these effects, ensuring that models are tested on their inference capabilities rather than on dataset artifacts.
Evaluation and Results
The paper evaluates two competitive RC models, BiDAF and FastQA, on both datasets. The evaluations reveal that while models can extract information across documents, there remains a significant gap compared to human performance — 54.5% accuracy for models versus 85.0% for humans. These results emphasize opportunities for improvement in the integration and selection of relevant information. The datasets exhibit strong baselines but offer substantial scope for enhancing inference capabilities.
Implications and Future Research
By constructing datasets that simulate real-world scenarios of cross-document information retrieval, this research lays the groundwork for developing more sophisticated RC systems. The implications are significant for fields such as information extraction and search, where multi-document synthesis can lead to more robust knowledge discovery processes.
Future research could focus on refining document selection techniques and improving models' abilities to handle the inherent complexities of cross-document reasoning. This work has opened new avenues for exploring AI's potential in understanding and processing dispersed information sources.
Conclusion
The authors provide a detailed approach to creating RC tasks that require sophisticated multi-hop reasoning. The datasets and methodologies proposed mark a significant step forward in evaluating and enhancing current RC models' capabilities. The research serves as a foundational step toward more advanced systems that can answer complex queries by seamlessly integrating scattered pieces of evidence across multiple documents.