PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them (2102.07033v1)

Published 13 Feb 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ's strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to ``back-off" to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

PDF Abstract

Analyzing "PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them"

The paper, "PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them," presents a significant contribution to the field of Open-Domain Question Answering (ODQA). It introduces a novel resource named Probably Asked Questions (PAQ), which comprises a vast set of 65 million QA pairs generated to cover a wide spectrum of potential open-domain questions. The authors aim to explore the practical utility of this resource in augmenting existing ODQA systems, particularly focusing on efficiency and interpretability.

Models that leverage QA-pairs directly, termed QA-pair retrievers, have shown to be advantageous in speed and memory efficiency. These models differ from the more traditional retrieve-and-read systems that require access to a substantial corpus to identify and extract answers dynamically. The latter approach, while thorough, is often slower and more computationally demanding. In contrast, QA-pair retrievers can pre-compute answers for likely questions, thus potentially offering a real-time response capability. However, they traditionally lack the expansive coverage provided by text corpora such as Wikipedia, which limits their accuracy.

To overcome this limitation, PAQ is created using automated question generation and involves a sophisticated filtering process to ensure relevance and correctness. The inclusion of the RePAQ model embodies a retrieval-enhanced question answering system that aims to leverage the PAQ dataset efficiently. RePAQ demonstrates the ability to match or even exceed the accuracy of recent large-scale retrieve-and-read systems like RAG and FiD, all while maintaining superior speed and efficiency.

Key empirical results underscore RePAQ's proficiency. The PAQ dataset significantly enhances the generalization capacity of the RePAQ model. Notably, RePAQ trained with PAQ consistently outperforms existing baselines with impressive speed (answering over 1000 questions per second) and is capable of adapting to constraints on system size or processing speed without substantial losses in accuracy.

From a theoretical standpoint, PAQ represents an intriguing intersection of QA-pair and knowledge retrieval paradigms, challenging the community to rethink how LLM architectures might optimize the balance between model size, inference speed, and accuracy. The findings suggest a paradigm shift toward systems that intelligently pre-cache probable questions, reducing the need for intensive retrieval operations at inference time.

On a speculative note, as AI evolves, further improvements might involve dynamic augmentation of QA-pairs based on real-world trends, adaptive-learning mechanisms that tailor responses to user preferences, and even deploying these advancements across multi-modal scenarios, bridging the gap between textual and visual QA systems.

In summary, "PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them" presents a detailed exploration and validation of QA-pair retrieval methodologies in ODQA contexts, offering a valuable resource and a basis for future enhancements in stored knowledge access methodologies within artificial intelligence systems.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Patrick Lewis (37 papers)
Yuxiang Wu (27 papers)
Linqing Liu (11 papers)
Pasquale Minervini (88 papers)
Heinrich Küttler (17 papers)
Aleksandra Piktus (20 papers)
Pontus Stenetorp (68 papers)
Sebastian Riedel (140 papers)

Citations (219)

View on Semantic Scholar

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them (2102.07033v1)

Analyzing "PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them"

Related Papers