Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets (2008.02637v1)

Published 6 Aug 2020 in cs.CL and cs.AI

Abstract: Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize, and what drives their overall performance. We find that all models perform dramatically worse on questions that cannot be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models out-perform a BART closed-book QA model, further highlighting the role that training set memorization plays in these benchmarks

PDF Abstract

Analysis of Test-Train Overlap in Open-Domain Question Answering

The paper "Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets" provides a meticulous examination of how open-domain question answering (ODQA) models interact with training and test data. The authors argue that existing evaluation metrics for ODQA might lack depth due to significant overlaps in test and train datasets. This paper focuses on dissecting these overlaps and highlighting their implications for model evaluation and generalization.

Key Insights and Findings

To understand the capabilities of ODQA models more comprehensively, the authors assess test-train overlaps across three prevalent ODQA datasets: WebQuestions, TriviaQA, and Open Natural Questions. They classify question-answering tasks into three behavioral categories:

Question Memorization: Models recall answers for questions previously encountered during training.
Answer Classification: Models answer new questions using answers that appeared during training.
QA Generalization: Models tackle entirely novel questions with unfamiliar answers.

The authors identify a substantial amount of overlap between the training and test sets: between 58% and 71% of test set answers are also present in the training data, and approximately 30% of test questions have training set duplicates. Specifically, the analysis reveals that a significant portion of model performance in these datasets is attributable to memorization rather than genuine understanding or generalization.

Implications for Model Performance

The paper assesses several ODQA models, including open-book models like RAG, DPR, and FID, closed-book models such as T5-11B and BART, and simple nearest-neighbor models. Notably, models demonstrate varied performance levels across different overlap categories. Most models show a considerable decline in performance when confronted with non-overlapping test questions and answers compared to those that overlap, underscoring the effect of memorization on their apparent competence.

For instance, while RAG achieves around 44.5% total accuracy on the Open Natural Questions dataset, its performance drops to 24.8% on non-overlapping questions. In closed-book settings, BART struggles with generalization, managing a mere 0.8% accuracy on truly novel question-answer pairs. Interestingly, nearest-neighbor models perform surprisingly well on question memorization tasks, illustrating the potential of simple retrieval-based approaches when a large portion of questions exhibit overlap.

Practical and Theoretical Implications

The findings in this paper suggest a need for more nuanced benchmarks in evaluating ODQA systems. Current datasets may inadequately measure a model’s ability to generalize, as they predominantly test memorization efficiency. This insight prompts a reconsideration of how we evaluate NLP models, advocating for datasets that better differentiate between memorization and deeper question understanding.

Additionally, the results have broad implications for the development of future ODQA systems. The critical examination of existing datasets can guide the creation of more challenging benchmarks that emphasize generalization and novel reasoning abilities. This direction aligns with ongoing discussions in the AI community about addressing the limitations of pre-trained models and fostering innovations that promote out-of-the-box problem-solving skills.

Future Directions

The exploration of dataset overlap opens several research avenues. Future work could involve designing novel datasets intentionally crafted to probe the distinct reasoning capabilities of ODQA models. There is also potential to explore how these findings intersect with other areas of machine learning, such as the development of algorithms better equipped to handle unseen data and the integration of external knowledge sources to enhance contextual understanding.

Overall, this paper provides critical insights into the real drivers behind the performance of ODQA systems and serves as a call to action for developing more robust evaluation methods that move beyond statistical memorization toward genuine comprehension and reasoning in AI models.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Patrick Lewis (37 papers)
Pontus Stenetorp (68 papers)
Sebastian Riedel (140 papers)

Citations (182)

View on Semantic Scholar