A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Published 18 Apr 2017 in cs.CL | (1704.05426v4)

Abstract: This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. In addition to being one of the largest corpora available for the task of NLI, at 433k examples, this corpus improves upon available resources in its coverage: it offers data from ten distinct genres of written and spoken English--making it possible to evaluate systems on nearly the full complexity of the language--and it offers an explicit setting for the evaluation of cross-genre domain adaptation.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (4,211)

View on Semantic Scholar

Summary

The paper presents MultiNLI, a 433K example corpus spanning 10 genres to advance evaluation of sentence understanding through inference.
The authors demonstrate that conventional NLI datasets lack linguistic complexity, leading to reduced model performance on diverse, real-world data.
Baseline models like CBOW, BiLSTM, and ESIM show significant performance drops, underscoring the corpus's challenge and potential for improved domain adaptation.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The paper "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference" by Adina Williams, Nikita Nangia, and Samuel R. Bowman introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset specifically designed to improve the development and evaluation of machine learning models for sentence understanding. This corpus, comprising 433,000 examples, builds upon existing resources, notably the Stanford NLI (SNLI) corpus, by expanding coverage and difficulty.

Introduction and Motivation

Natural Language Inference (NLI), also referred to as Recognizing Textual Entailment (RTE), involves determining the relationship between two sentences across predefined labels: entailment, neutral, or contradiction. This task serves as a critical benchmark in assessing representation learning models' ability to derive meaningful representations of sentences. Existing corpora, such as SNLI, have facilitated substantial advancements in NLU, yet they fall short in coverage and complexity as they are derived from a single genre—image captions. Thus, SNLI primarily encompasses short and simple sentences.

MultiNLI addresses these limitations by offering a larger and more complex set of data covering ten distinct genres of written and spoken English. This approach allows for a more comprehensive evaluation of models, particularly concerning their ability to handle cross-genre domain adaptation.

Corpus Construction

The MultiNLI dataset consists of premise-hypothesis sentence pairs that were carefully selected to span a variety of genres, including face-to-face conversations, government documents, letters, fiction, and more. The dataset's size and collection methodology resemble SNLI, but the diversity of genres makes MultiNLI a significantly more challenging task.

The authors ensured the reliability of the dataset through a rigorous validation process similar to that employed for SNLI. Each sentence pair underwent multiple rounds of annotation to maintain high inter-annotator agreement.

Empirical Findings

In comparing MultiNLI with SNLI, the authors found that MultiNLI exhibits higher linguistic complexity and diversity. The average length of premise sentences in MultiNLI is significantly longer than those in SNLI, and the genres encompass more intricate phenomena like temporal reasoning, belief, and modality. Consequently, existing machine learning models showed a notable performance drop when evaluated on MultiNLI as opposed to SNLI, highlighting the increased difficulty of the new corpus.

Baseline Models

The paper evaluates three neural network models: Continuous Bag of Words (CBOW), Bidirectional LSTM (BiLSTM), and the Enhanced Sequential Inference Model (ESIM). These models were trained on both the SNLI and MultiNLI datasets. The results demonstrated that models trained on SNLI performed poorly on MultiNLI, indicating that while SNLI is effective for simpler sentence pairs, it does not generalize well to more complex, cross-genre situations encapsulated by MultiNLI.

Implications and Future Directions

The introduction of MultiNLI signifies an important step toward more robust and generalizable natural language understanding models. The dataset’s genre diversity and complexity create a more challenging environment for evaluating model performance, thereby pushing the boundaries of current NLI research. Beyond serving as a challenging benchmark, MultiNLI also provides a fertile ground for exploring domain adaptation and transfer learning techniques.

As future research progresses, MultiNLI can be utilized to test and refine models that promise better generalization across diverse linguistic phenomena. The implications of this work extend to various applications within NLP, such as question answering and dialogue systems, which require nuanced understanding and inference capabilities.

Conclusion

The MultiNLI corpus offers a substantial improvement over earlier NLI datasets by providing a richer, more diverse range of sentence pairs that better reflect the complexities of modern English. Its introduction invites the NLP community to develop more sophisticated models capable of handling real-world language variability. This paper exemplifies a significant stride in advancing the field of sentence understanding and domain adaptation, and MultiNLI is poised to remain a cornerstone in NLI research for years to come.

Markdown Report Issue