Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constructing Datasets for Multi-hop Reading Comprehension Across Documents (1710.06481v2)

Published 17 Oct 2017 in cs.CL and cs.AI

Abstract: Most Reading Comprehension methods limit themselves to queries which can be answered using a single sentence, paragraph, or document. Enabling models to combine disjoint pieces of textual evidence would extend the scope of machine comprehension methods, but currently there exist no resources to train and test this capability. We propose a novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods. In our task, a model learns to seek and combine evidence - effectively performing multi-hop (alias multi-step) inference. We devise a methodology to produce datasets for this task, given a collection of query-answer pairs and thematically linked documents. Two datasets from different domains are induced, and we identify potential pitfalls and devise circumvention strategies. We evaluate two previously proposed competitive models and find that one can integrate information across documents. However, both models struggle to select relevant information, as providing documents guaranteed to be relevant greatly improves their performance. While the models outperform several strong baselines, their best accuracy reaches 42.9% compared to human performance at 74.0% - leaving ample room for improvement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Johannes Welbl (20 papers)
  2. Pontus Stenetorp (68 papers)
  3. Sebastian Riedel (140 papers)
Citations (487)

Summary

Constructing Datasets for Multi-hop Reading Comprehension Across Documents

This paper by Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel discusses the development of datasets tailored for evaluating multi-hop reading comprehension (RC) models across multiple documents. The research addresses the limitations of current RC methodologies, which predominantly focus on answering queries using localized information from a single sentence or document. By encouraging models to synthesize disjoint textual evidence, the paper aims to advance the scope of machine comprehension techniques.

Introduction and Motivation

Traditional RC tasks often limit the retrieval of information to single-source queries. This research introduces a novel task emphasizing multi-hop inference, requiring models to seek and combine evidence from multiple documents. This task is motivated by practical applications where critical information is distributed across various sources, such as scientific literature or encyclopedic knowledge bases.

Methodology and Dataset Construction

The paper presents a methodology for constructing datasets that enable the aforementioned multi-hop capabilities. The authors generate these datasets by using thematically linked documents from existing collections of query-answer pairs. Specifically, two datasets are derived for different domains:

  1. WikiHop: Leveraging the interconnected articles within Wikipedia and the structured data from Wikidata. The dataset focuses on queries about entities where no single document holds the complete answer.
  2. MedHop: Using abstracts from Medline and facts from DrugBank, the dataset addresses drug-drug interaction queries requiring inferences drawn from multiple scientific findings.

A bipartite graph methodology is introduced to build these datasets, linking documents to entities based on their mentions and thematic relevance. The traversal of this graph ensures that evidence is collected across documents, necessitating multi-hop reasoning.

Addressing Challenges and Biases

Dataset construction is susceptible to biases and inaccuracies. The paper highlights potential issues such as candidate frequency imbalance and document-answer correlations. The authors introduce filtering strategies to mitigate these effects, ensuring that models are tested on their inference capabilities rather than on dataset artifacts.

Evaluation and Results

The paper evaluates two competitive RC models, BiDAF and FastQA, on both datasets. The evaluations reveal that while models can extract information across documents, there remains a significant gap compared to human performance — 54.5% accuracy for models versus 85.0% for humans. These results emphasize opportunities for improvement in the integration and selection of relevant information. The datasets exhibit strong baselines but offer substantial scope for enhancing inference capabilities.

Implications and Future Research

By constructing datasets that simulate real-world scenarios of cross-document information retrieval, this research lays the groundwork for developing more sophisticated RC systems. The implications are significant for fields such as information extraction and search, where multi-document synthesis can lead to more robust knowledge discovery processes.

Future research could focus on refining document selection techniques and improving models' abilities to handle the inherent complexities of cross-document reasoning. This work has opened new avenues for exploring AI's potential in understanding and processing dispersed information sources.

Conclusion

The authors provide a detailed approach to creating RC tasks that require sophisticated multi-hop reasoning. The datasets and methodologies proposed mark a significant step forward in evaluating and enhancing current RC models' capabilities. The research serves as a foundational step toward more advanced systems that can answer complex queries by seamlessly integrating scattered pieces of evidence across multiple documents.