HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data (2004.07347v3)

Published 15 Apr 2020 in cs.CL and cs.AI

Abstract: Existing question answering datasets focus on dealing with homogeneous information, based either only on text or KB/Table information alone. However, as human knowledge is distributed over heterogeneous forms, using homogeneous information alone might lead to severe coverage problems. To fill in the gap, we present HybridQA https://github.com/wenhuchen/HybridQA, a new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and multiple free-form corpora linked with the entities in the table. The questions are designed to aggregate both tabular information and text information, i.e., lack of either form would render the question unanswerable. We test with three different models: 1) a table-only model. 2) text-only model. 3) a hybrid model that combines heterogeneous information to find the answer. The experimental results show that the EM scores obtained by two baselines are below 20\%, while the hybrid model can achieve an EM over 40\%. This gap suggests the necessity to aggregate heterogeneous information in HybridQA. However, the hybrid model's score is still far behind human performance. Hence, HybridQA can serve as a challenging benchmark to study question answering with heterogeneous information.

PDF Abstract

Overview of "HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data"

The paper "HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data" contributes to the field of question answering (QA) by introducing a new dataset that combines the strengths of both tabular data and free-form text. The goal is to address the limitations of existing datasets that typically rely on either structured knowledge bases (KB) or unstructured text, thereby encountering coverage issues due to the homogeneity of information sources.

Key Contributions

The authors present a dataset, HybridQA, that uniquely requires multi-hop reasoning across heterogeneous data. Each question is linked to a Wikipedia table and associated textual data. The central challenge is that without integrating both data forms, the questions become unanswerable. This setup mirrors a more realistic scenario where information is distributed across different data types, thus necessitating the development of models capable of heterogeneous reasoning.

Dataset Characteristics: HybridQA comprises approximately 70,000 QA pairs, each aligned with 13,000 Wikipedia tables. The dataset is bifurcated into questions requiring reasoning across table-to-text, text-to-table, and hybrid chains, with a significant proportion demanding multi-step processing.
Performance Benchmarking: The paper evaluates three models: a table-only model, a text-only model, and a hybrid model. Results indicate substantial performance improvements with the hybrid model achieving an Exact Match (EM) score exceeding 40%, while the other two models remained below 20%. This demonstrates the necessity of integrating heterogeneous data sources to approach human-level QA performance.
Annotation Process: The dataset was meticulously annotated to ensure questions require integrated data processing, minimizing biases such as table positioning or passage prominence. The authors implemented several debiasing strategies throughout the annotation process, emphasizing the creation of truly hybrid questions.
Error Analysis and Model Performance: Despite the hybrid model outperforming others, a gap remains between model and human performance. Error analysis revealed areas for improvement, particularly in linking and reasoning phases. Advancements in this area could bridge the performance divide.

Implications and Future Directions

HybridQA holds significant implications for the development of advanced QA systems capable of processing and reasoning over diverse data formats. This dataset challenges current models, prompting the design of architectures that can effectively aggregate information from multimodal sources. The paper suggests that QA systems, such as the proposed "hybrider," must evolve to handle the complexity inherent in real-world data, where information is not neatly categorized into structured or unstructured forms.

The introduction of HybridQA paves the way for more sophisticated approaches to question answering, potentially benefiting fields like AI-driven research tools, educational technologies, and automated data analysis. Future developments could focus on refining the hybrid model and addressing the identified areas of error propagation to achieve even closer human-competitive performance.

By expanding the scope of possible data environments that QA models can handle, HybridQA represents a critical step towards more flexible, knowledge-rich AI systems. Researchers are encouraged to utilize and build upon this dataset to push the boundaries of current QA capabilities and to address the open challenges highlighted by the hybrid setting.