ReQA: An Evaluation for End-to-End Answer Retrieval Models (1907.04780v2)

Published 10 Jul 2019 in cs.CL

Abstract: Popular QA benchmarks like SQuAD have driven progress on the task of identifying answer spans within a specific passage, with models now surpassing human performance. However, retrieving relevant answers from a huge corpus of documents is still a challenging problem, and places different requirements on the model architecture. There is growing interest in developing scalable answer retrieval models trained end-to-end, bypassing the typical document retrieval step. In this paper, we introduce Retrieval Question-Answering (ReQA), a benchmark for evaluating large-scale sentence-level answer retrieval models. We establish baselines using both neural encoding models as well as classical information retrieval techniques. We release our evaluation code to encourage further work on this challenging task.

PDF Abstract

An Evaluation for End-to-End Answer Retrieval Models

The paper "ReQA: An Evaluation for End-to-End Answer Retrieval Models" explores the development and evaluation of models that can retrieve answer sentences from a large corpus of documents in response to specific queries. This work introduces the Retrieval Question-Answering (ReQA) benchmark to test the efficacy of such models.

Background and Motivation

Traditional QA benchmarks like SQuAD have primarily focused on identifying answer spans within predefined passages, and models have achieved performance levels consistently surpassing human benchmarks. However, the challenge persists in retrieving relevant answers from a comprehensive document corpus. This task poses distinct requirements for model architectures. There is an increasing interest in scalable and efficient models that can bypass the initial document retrieval step and carry out end-to-end answer retrieval.

ReQA Benchmark

The ReQA benchmark emphasizes sentence-level answer retrieval on a large scale. The dataset transformation process reimagines existing QA datasets, like SQuAD and Natural Questions (NQ), into a format suitable for cross-document retrieval evaluation. This involves treating sentences within documents as individual answer candidates, thereby requiring models to perform both efficient retrieval and accurate semantic understanding across documents.

Evaluation Metrics

The evaluation framework utilizes standard ranking metrics such as mean reciprocal rank (MRR) and recall at N (R@N). These metrics assess models based on their ability to rank and retrieve correct answers effectively from thousands of candidates.

Key Contributions

End-to-End Retrieval Focus: The paper stresses the need for direct retrieval of answers at the sentence level, seeking to avoid intermediate steps like document retrieval that may obfuscate the retrieval task.
Scalability: Models are required to map questions and sentences into a shared vector space independently. This approach allows the use of techniques like approximate nearest neighbor searches.
Context Awareness: Retrieval models need to consider both the sentences themselves and their contextual paragraphs to discern relevance effectively.
General Purpose Capability: The framework promotes evaluation on datasets distinct from training data to validate model generalization across diverse domains.

Baselines and Results

The paper presents several model baselines, notably the Universal Sentence Encoder for QA (USE-QA) developed by Google Research, which serves as a dual encoder model optimized for semantic retrieval. The empirical results demonstrate that USE-QA outperforms classic information retrieval models like BM25, particularly in paragraph-level search settings, suggesting significant progress toward achieving effective end-to-end retrieval solutions without relying on pipelined document identification.

Implications and Future Directions

The ReQA benchmark marks a crucial step towards developing open-domain QA systems capable of efficiently retrieving relevant information directly in answer form, which holds substantial implications for real-world applications like conversational agents and information retrieval systems.

The research highlights multiple areas for future exploration, including techniques for improving model accuracy without a proportional increase in computational costs, the potential for integrating contextual understanding more deeply into models, and development paths focusing on fine-tuning retrieval strategies under different constraints and use-cases.

By facilitating rigorous evaluation and pushing the boundaries of current retrieval models, the ReQA framework paves the way for advancements that may contribute substantially to the theoretical and practical evolution of AI-driven QA systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Amin Ahmad (5 papers)
Noah Constant (32 papers)
Yinfei Yang (73 papers)
Daniel Cer (28 papers)

Citations (49)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos