Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reading Wikipedia to Answer Open-Domain Questions (1704.00051v2)

Published 31 Mar 2017 in cs.CL

Abstract: This paper proposes to tackle open- domain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.

Reading Wikipedia to Answer Open-Domain Questions

The paper "Reading Wikipedia to Answer Open-Domain Questions," authored by Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes, addresses the ambitious task of open-domain question answering (QA) using Wikipedia as the singular knowledge source. The researchers confront the challenges inherent to machine reading at scale by introducing DrQA, a system that integrates document retrieval and machine comprehension to accurately answer factoid questions.

The unique aspect of this research is the reliance on Wikipedia exclusively, unlike other QA systems like IBM's DeepQA which amalgamate various sources such as knowledge bases (KBs), news articles, and books. The authors argue that employing a single, albeit comprehensive, knowledge source like Wikipedia ensures the precision and specificity of the answers.

System Architecture

The DrQA system comprises two primary components: Document Retriever and Document Reader.

  1. Document Retriever This module is responsible for efficiently narrowing the search space. It uses bigram hashing and TF-IDF matching to retrieve a small subset of relevant Wikipedia articles based on the input question. Empirical results demonstrate this approach's superiority over Wikipedia's built-in search engine, achieving up to 86.0% accuracy in retrieving documents that contain the correct answer.
  2. Document Reader The Document Reader is a multi-layer recurrent neural network (RNN) geared towards machine comprehension. This model processes each paragraph within the retrieved articles to identify answer spans. The architecture leverages various embeddings, including word embeddings and exact match features, and employs a bidirectional long short-term memory (LSTM) network for encoding. The system's performance on the SQuAD dataset is competitive, achieving 70.0% EM and 79.0% F1 score, surpassing several state-of-the-art models at the time.

Evaluation and Results

Document Retrieval

The Document Retriever component was evaluated across multiple QA datasets—SQuAD, CuratedTREC, WebQuestions, and WikiMovies. The retrieval accuracy ranged between 54.4% and 77.8% for the top 5 articles, showcasing significant improvements via bigram hashing over standard methods.

Machine Comprehension

The Document Reader was specifically tested on the SQuAD dataset, yielding robust results. An ablation paper further highlighted the importance of features like aligned question embeddings, which notably enhanced interpretation accuracy.

Full System Performance

DrQA was rigorously tested in an open-domain setting across the four datasets. Three training setups were compared: training only on SQuAD, fine-tuning with distant supervision (DS), and a multitask learning approach combining all datasets. The multitask model with DS demonstrated the best performance across all datasets, indicating the benefits of leveraging diverse data sources and training paradigms.

Implications and Future Directions

The implications of this work are substantial for the fields of NLP and information retrieval. By demonstrating that Wikipedia alone can serve as an effective knowledge source for open-domain QA, the paper paves the way for more streamlined, resource-efficient QA systems. The use of distant supervision and multitask learning highlights how integrating diverse datasets can enhance model robustness and generalization.

Future research can improve upon DrQA by integrating Document Reader more cohesively with Document Retriever, and by training both components end-to-end. Further advancements in fine-grained document retrieval and robust paragraph comprehension could also enhance the system's ability to handle the broad diversity of queries encountered in real-world applications.

In summary, this paper effectively tackles the significant challenge of machine reading at scale. The proposed DrQA system marks a notable advancement, demonstrating the feasibility and efficacy of using Wikipedia as a singular, comprehensive knowledge source for open-domain question answering.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Danqi Chen (84 papers)
  2. Adam Fisch (32 papers)
  3. Jason Weston (130 papers)
  4. Antoine Bordes (34 papers)
Citations (1,907)
Youtube Logo Streamline Icon: https://streamlinehq.com