Adversarial NLI: A New Benchmark for Natural Language Understanding (1910.14599v2)

Published 31 Oct 2019 in cs.CL and cs.LG

Abstract: We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.

Authors (6)

Yixin Nie (25 papers)
Adina Williams (72 papers)
Emily Dinan (28 papers)
Mohit Bansal (304 papers)
Jason Weston (130 papers)
Douwe Kiela (85 papers)

Citations (926)

View on Semantic Scholar

Summary

The paper presents the innovative HAMLET procedure that iteratively builds a challenging adversarial dataset.
It demonstrates that training on ANLI achieves state-of-the-art NLU performance and significantly improves model robustness.
Detailed analyses reveal linguistic complexities and model failure modes through adversarially curated examples.

Adversarial NLI: A New Benchmark for Natural Language Understanding

The paper, "Adversarial NLI: A New Benchmark for Natural Language Understanding," introduces the Adversarial NLI (ANLI) benchmark, which is designed to challenge the capabilities of state-of-the-art NLU models through an innovative human-and-model-in-the-loop data collection methodology called Human-And-Model-in-the-Loop Enabled Training (HAMLET). This paper makes significant contributions to the field by presenting a dataset that progressively increases in difficulty and incorporates iterative adversarial human involvement.

Key Contributions

The paper's key contributions are threefold:

Novel Dataset Creation:
- The ANLI dataset was collected through three rounds, each employing increasingly robust NLU models to ensure the data progressively challenges current state-of-the-art systems.
- Each round involves human annotators ("white hat hackers") who generate hypotheses intended to trick the models. Examples are verified by additional human annotators to ensure correctness.
- The dataset comprises diverse sources, including Wikipedia, news articles, fiction, formal spoken text, and procedural guides.
State-of-the-Art Performance:
- Training models on ANLI led to state-of-the-art performance on various NLI benchmarks, including SNLI and MNLI.
- This confirms the hypothesis that adversarially trained models become more robust and generalize better across multiple benchmarks.
Detailed Analysis and Insights:
- The analysis reveals the specific failure modes and weaknesses of contemporary NLU models.
- Linguistic and statistical characteristics of the dataset highlight the complexity and structure of the adversarial examples.
- Hypothesis-only models perform poorly on ANLI, contrasting sharply with their performance on existing datasets, thereby mitigating concerns of spurious statistical pattern exploitation.

Methodology

HAMLET Procedure

The HAMLET procedure is pivotal to the dataset's creation. The process consists of:

Initial Model Deployment:
- A base model, trained on existing NLI datasets like SNLI and MNLI, is used as a starting point.
- Annotators create context-hypothesis pairs aimed at fooling the model.
Example Verification:
- Other human annotators verify whether submitted examples that the model misclassified are correct. They resolve any disagreements to ensure accuracy.
Iterative Training:
- Verified examples from each round are added to the training set, and a more robust model is trained.
- Subsequent rounds use stronger models and progressively more challenging contexts, creating a dynamic and continually evolving benchmark.

Results and Implications

Benchmark Performance

The ANLI benchmark was extensively evaluated across several dimensions:

Dataset Complexity:
- The dataset includes nuanced linguistic phenomena such as numerical reasoning (represented in 27% of examples), coreference resolution, lexical inference, and tricky wordplay.
- Validation and test subsets showed strong inter-annotator agreement, underscoring the reliability of the dataset.
Model Robustness:
- Training on adversarial data from ANLI significantly improves robustness and model performance on traditional NLI benchmarks.
- Models show a marked ability to generalize and perform well on stress tests designed to probe weaknesses like negation handling and numerical reasoning.
Linguistic Insight:
- Detailed linguistic analysis highlights the interplay of various inference types within the ANLI dataset.
- Annotators' creativity in constructing examples that exploit model weaknesses points to a rich avenue for exploring model improvement.

Theoretical and Practical Implications

The ANLI benchmark and HAMLET procedure represent significant advances in understanding and improving NLU models' robustness. The dynamic nature of the adversarial collection process means that ANLI can remain a relevant benchmark for future models. This approach mitigates saturation issues seen in static benchmarks, providing a continually moving target for NLU research.

Future Directions

The ANLI benchmark opens several avenues for future research:

Continued Rounds:
- Future rounds of data collection could further increase model robustness, addressing newly emerging weaknesses.
Application to Other Domains:
- The HAMLET procedure could be adapted for various classification tasks beyond NLI, including ranking tasks and those necessitating hard-negative generation.
Cross-Model Comparison:
- Exploration of how different model architectures handle adversarial data can offer insights into designing more resilient NLU systems.

In conclusion, the "Adversarial NLI: A New Benchmark for Natural Language Understanding" paper provides a substantial contribution to the field with its innovative dynamic benchmarking approach, robust dataset, and clear implications for improving NLU models.

PDF Markdown

Related Papers

YouTube

Show All Videos