Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences (1805.02266v1)

Published 6 May 2018 in cs.CL

Abstract: We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Max Glockner (9 papers)
  2. Vered Shwartz (49 papers)
  3. Yoav Goldberg (142 papers)
Citations (361)

Summary

  • The paper demonstrates that NLI systems experience a significant performance drop on simple lexical inferences despite high benchmark scores.
  • It reveals that even with external lexical knowledge, as in the KIM model, improvements remain marginal.
  • The study calls for new evaluation metrics and model strategies to advance genuine natural language comprehension.

Introduction

Natural Language Inference (NLI), historically termed Recognizing Textual Entailment (RTE), is a benchmark task in the field of NLP. It involves determining whether a hypothesis sentence can be inferred from a given premise sentence. NLI has seen considerable advancements, particularly after the introduction of the Stanford Natural Language Inference (SNLI) dataset, which has enabled the development of many neural-based end-to-end models that show impressive performance. However, Max Glockner, Vered Shwartz, and Yoav Goldberg's investigation raises questions about the true capabilities of these models for understanding natural language, especially their dependency on lexical and world knowledge to make simple inferences.

Lexical Inference in NLI Systems

The research underscores a critical vulnerability in current state-of-the-art NLI systems. Despite achieving high accuracy on benchmark datasets, these systems falter when faced with sentences that involve simple lexical inferences such as synonyms, antonyms, hypernyms, and co-hyponyms. To isolate this issue, the authors created a new NLI test set, carefully designed so sentences differ by at most one word from those in the training set, ensuring that the vocabulary remains unchanged. The results across various NLI systems demonstrated a substantial performance drop on this new test set, indicating a profound deficiency in capturing basic lexical relationships inherent in human language understanding.

The Impact of Lexical Knowledge

The authors provide evidence suggesting that the integration of external lexical knowledge into NLI models yields only marginal gains. They examined a model that incorporated such knowledge, KIM (Knowledge-based Inference Model), and observed a significantly smaller drop in its performance on the new test set, implying that while lexical knowledge is crucial, the current methods for incorporating it into NLI models are inadequate. A striking outcome from this paper is that simple baselines utilizing WordNet performed remarkably well on the test set, further highlighting that NLI models trained on SNLI and other similar datasets struggle with elementary lexical inferences even though they contain no new words and are constructed to be simpler.

Implications and Future Directions

The paper's findings have profound implications for the NLP community. They indicate that current benchmark datasets might not be sufficient measures of true language understanding capabilities. This observation aligns with other research showing that models can often predict labels based solely on hypothesis sentences by exploiting dataset artifacts like word choice and sentence length. The proposed test set serves as an additional evaluation metric that can be used in parallel with existing datasets to better assess a model’s lexical inference ability. For future model development, these results underscore the need to go beyond the current neural methods and develop strategies that can effectively incorporate lexical and world knowledge.

In conclusion, the authors present a compelling case for re-evaluating our assessment of NLI systems. While neural models have shown exceptional performance on existing benchmarks, their ability to understand and process simple lexical inferences remains limited. This paper stands as a call to action for the NLP community to place a greater emphasis on creating models that can genuinely comprehend language, as humans do, rather than merely overfitting to dataset-specific idiosyncrasies.