Annotation Artifacts in Natural Language Inference Data (1803.02324v2)

Published 6 Mar 2018 in cs.CL and cs.AI

Abstract: Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.

Authors (6)

Suchin Gururangan (29 papers)
Swabha Swayamdipta (49 papers)
Omer Levy (70 papers)
Roy Schwartz (74 papers)
Samuel R. Bowman (103 papers)
Noah A. Smith (224 papers)

Citations (1,121)

View on Semantic Scholar

Summary

Annotation Artifacts in Natural Language Inference Data: An Analytical Summary

The paper "Annotation Artifacts in Natural Language Inference Data" authored by Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith, presents a thorough examination of the presence and impact of annotation artifacts within the field of Natural Language Inference (NLI) datasets, specifically focusing on the Stanford Natural Language Inference (SNLI) and Multi-Genre Natural Language Inference (MultiNLI) datasets.

Key Findings

The primary observation is that the process used to generate these datasets produces linguistic patterns or artifacts that can significantly aid in predicting the inference classification of the hypothesis, independent of the premise. Using an off-the-shelf text classification model, it is demonstrated that:

67% of SNLI and 53% of MultiNLI examples can be categorized correctly based solely on the hypothesis.
The researchers identify that certain linguistic elements, such as negation, purpose clauses, and vague language, correlate highly with specific inference classes.

Analysis and Re-evaluation

The analysis is twofold:

Quantifying Annotation Artifacts:
- Models such as fastText achieve 67% accuracy on SNLI and around 53% on MultiNLI without accessing the premise.
- The artifacts are products of the strategies adopted by crowd workers, involving removing specific details for entailments and introducing negations for contradictions.
Impact on State-of-the-Art Models:
- The re-evaluation of high-performing NLI models (Decomposable Attention Model, Enhanced Sequential Inference Model, and Densely Interactive Inference Network) on a dataset split into "Easy" and "Hard" subsets (based on success/failure of the hypothesis-only classifier) reveals a stark contrast.
- Performance on the "Hard" subsets is significantly lower, suggesting that current NLI models are heavily reliant on these artifacts for achieving high performance, thus misrepresenting the true challenge of the inference task.

Characteristics of the Artifacts

A detailed analysis highlights how specific words and lengths of hypotheses are indicative of their classification:

Lexical Choice:
- The use of words like "animal" and "outside" for entailments.
- Frequent use of negations (e.g., "nobody", "no") for contradictions.
- Higher prevalence of modifiers and superlatives in neutral hypotheses.
Sentence Length:
- Neutral hypotheses are generally longer on average, possibly due to the addition of extraneous details or purpose clauses.
- Short entailed hypotheses often result from removing specific details from the premise.

Implications and Future Directions

The findings of this paper have significant implications for the development and evaluation of NLI models:

Theoretical Implications:
- The current datasets may not present a true test of a model's ability to perform natural language inference. This could mean overestimation of the progress achieved in the field.
- There is a need to re-evaluate and possibly redesign NLI benchmarks to ensure they are not biased by the annotation artifacts.
Practical Implications:
- Models trained on such datasets may not generalize well to real-world scenarios where the hypothesis does not contain these biases.
- Future dataset collection methodologies might incorporate mechanisms to identify and minimize these artifacts during the annotation process.

Future Developments

The discovery of these artifacts opens several avenues for future research:

Balanced Datasets:
- Creation of balanced datasets, where the biases are evenly distributed, could offer a more robust benchmark for assessing model performance.
Training Improvements:
- Strategies to mitigate reliance on artifacts could involve adversarial training, where models are trained to be robust against such simplistic heuristics.
Crowdsourcing Refinements:
- Developing new prompts and training for crowd workers to reduce the systematic patterns that lead to these annotation artifacts.

In conclusion, this paper underscores the importance of scrutinizing dataset creation processes and their impact on model evaluation, highlighting that the path to true natural language understanding remains fraught with challenges that go beyond mere model performance metrics.

PDF Markdown