Challenges in Generalization in Open Domain Question Answering (2109.01156v3)

Published 2 Sep 2021 in cs.CL and cs.AI

Abstract: Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is unclear which aspects of novel questions make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel-entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set -- indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities relatively well, they struggle with those requiring compositional generalization. Lastly, we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.

PDF Abstract

Challenges in Generalization in Open Domain Question Answering

The paper "Challenges in Generalization in Open Domain Question Answering" explores the difficulties of achieving robust generalization in Open Domain Question Answering (ODQA) systems. Although recent advancements in machine learning models have demonstrated significant improvements on standard NLP benchmarks, these models often underperform when faced with novel or adversarial test samples that are not closely aligned with their training data. This paper aims to better understand these challenges through systematic categorization and evaluation.

Core Contributions and Methodology

The core contribution of this paper is the categorization of ODQA questions into three distinct types of generalization challenges: training set overlap, compositional generalization (comp-gen), and novel-entity generalization. This categorization is pivotal for dissecting the nature of the challenges posed by different question types:

Training Set Overlap: Involves questions that are present in paraphrased form in the training data.
Compositional Generalization (Comp-gen): These questions require combining previously learned small knowledge bits to form new understanding paths.
Novel-Entity Generalization: Introduces entities in questions that were not present in the training dataset.

The authors manually annotated subsets of three prominent ODQA datasets – Natural Questions, TriviaQA, and WebQuestions – according to the above categories. This allows for an in-depth analysis of how current ODQA models handle different generalization types. The paper evaluates several popular ODQA models, including both parametric (T5-11B+SSM, BART) and non-parametric models (RAG, DPR, RePAQ, FiD), using these annotated questions.

Key Findings

The results revealed significant insights into the performance discrepancies across different generalization categories:

Non-parametric Models: These models generally perform better on novel-entity questions compared to compositional questions. This might suggest that such models are relatively more adept at handling novel entities but struggle significantly with understanding novel compositions of known entities. The retrieval accuracy for the comp-gen and novel-entity subsets is notably lower, highlighting a critical bottleneck in retriever performance.
Parametric Models: These display substantial drops in performance on both comp-gen and novel-entity questions, which indicates notable challenges in generalizing to new entities or compositions beyond what was directly observed during training. This is exacerbated by the models' tendency to revert to memorizing training data answers.
Question Pattern and Entity Frequency: A strong correlation was observed between test accuracy and the frequency of question patterns and entities in the training data. Particularly, models perform poorly on questions with frequent entities or those under-represented in the training data.

Implications for Future Research

The paper demonstrates that while significant strides have been made in ODQA, generalization remains a core challenge. It points to several avenues for further research and development:

Enhanced Retrieval Mechanisms: Improving retriever components to better capture novel compositions and entities is essential. Models must be equipped to not just identify individual relevant passages but also to assemble comprehensive answers from dispersed data within these passages.
Entity and Composition Embeddings: Developing methods that can better capture the nuances of novel entities and compositional semantics can improve generalization performance.
Balancing Memorization with Understanding: Striking a balance between memorizing training data and extending inferential logic to handle novel instances remains a vital challenge.

In sum, this paper provides a structured framework to evaluate and improve the generalization capabilities of ODQA systems, steering future endeavors towards more robust and flexible models capable of handling a wider array of unseen question types.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Linqing Liu (11 papers)
Patrick Lewis (37 papers)
Sebastian Riedel (140 papers)
Pontus Stenetorp (68 papers)

Citations (34)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos