MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (1611.09268v3)

Published 28 Nov 2016 in cs.CL and cs.IR

Abstract: We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

PDF Abstract

MS MARCO: A Comprehensive Machine Reading Comprehension Dataset

The paper "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset" introduces a large-scale, realistic dataset aimed at advancing research in machine reading comprehension (MRC) and open domain question answering (QA). Authored by researchers from Microsoft AI Research, the dataset - MS MARCO - addresses several limitations prevalent in existing MRC datasets while proposing novel benchmark tasks to stimulate further progress in the field.

Dataset Composition and Characteristics

MS MARCO is distinguished by its extensive scale and the real-world origin of its questions and passages. Key features of the dataset include:

1,010,916 Questions: Derived from anonymized Bing search logs, ensuring they reflect true user queries.
8,841,823 Passages: Sourced from 3,563,535 web documents retrieved by Bing, representing authentic and diverse textual information.
182,669 Human-Composed Answers: Created by editors who synthesize information from the retrieved passages to generate comprehensive answers.

Unlike prior datasets, which often utilize crowdsourced questions and curated text spans, MS MARCO employs actual search queries, ensuring the dataset captures the complexity and variability of natural information-seeking behavior. Furthermore, while many MRC datasets contain answerable questions only, MS MARCO includes instances where questions are deemed unanswerable based on retrieved passages, enhancing the robustness of model evaluations.

Proposed Tasks

The paper introduces three core tasks that leverage the unique characteristics of MS MARCO:

Answerability Prediction and Answer Generation: Determine if a question is answerable given a set of context passages and extract or synthesize a relevant answer.
Well-formed Answer Generation: Generate answers that are coherent and contextually complete even when considered in isolation from the question.
Passage Ranking: Rank retrieved passages based on their relevance to the question, facilitating improvements in information retrieval models.

These tasks are designed to test various aspects of MRC systems, from understanding and synthesis to retrieval and reasoning under realistic conditions.

Comparative Analysis with Existing Datasets

The paper compares MS MARCO with several prominent MRC datasets, emphasizing its advantages in terms of scale and realism (see Table: Comparison of MS MARCO and some of the other MRC datasets). Unlike datasets like SQuAD, NewsQA, and NarrativeQA, MS MARCO questions originate from actual user queries rather than artificially generated ones. Additionally, the diversity and authenticity of the web-sourced passages offer a more challenging and representative benchmark for MRC systems.

Experimental Validation

The paper details initial experiments using the v1.1 and v2.1 incarnations of the dataset. Experimental results illustrate both the complexity of MS MARCO and its suitability for benchmarking. Key findings include:

Generative Model Performance: Sequence-to-Sequence models and Memory Networks exhibited varying levels of efficacy, with ROUGE-L scores highlighting areas of potential improvement in answer generation.
Cloze-style Test Performance: Models like AS Reader and ReasoNet demonstrated respectable performance, yet indicated room for enhancement in handling complex numeric answers.
Human Baseline Evaluation: A human ensemble approach, leveraging top-performing editors, established a challenging benchmark for computational models, underlining the dataset's difficulty.

Implications and Future Directions

MS MARCO's scale and real-world basis make it highly suitable for developing and benchmarking advanced MRC and QA models. Its realistic question formulation and extraction challenges necessitate robust and versatile systems capable of handling noisy and ambiguous inputs, thus driving innovation in model architecture and training methodologies.

Future developments may include expanding the dataset to cover multilingual and cross-domain scenarios, refining evaluation metrics to better assess diverse answer generation, and fostering a collaborative environment wherein the research community can contribute to and benefit from this extensive dataset.

In conclusion, MS MARCO represents a pivotal step towards more realistic and effective MRC and QA systems, offering a comprehensive and challenging benchmark for current and future research in AI-driven reading comprehension.

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Payal Bajaj (13 papers)
Daniel Campos (62 papers)
Nick Craswell (51 papers)
Li Deng (76 papers)
Jianfeng Gao (344 papers)
Xiaodong Liu (162 papers)
Rangan Majumder (12 papers)
Andrew McNamara (3 papers)
Bhaskar Mitra (78 papers)
Tri Nguyen (47 papers)
Mir Rosenberg (1 paper)
Xia Song (38 papers)
Alina Stoica (1 paper)
Saurabh Tiwary (15 papers)
Tong Wang (144 papers)

Citations (2,436)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos