Introduction
Transformer-based LLMs (LMs) have shifted the landscape of Natural Language Understanding (NLU), with performance benchmarks suggesting a high capability for syntactic, logical, and semantic comprehension. This paper presents evidence that such claims may be overstated, as state-of-the-art Natural Language Inference (NLI) models demonstrate significant sensitivity to minor, semantics-preserving variations in surface form. This suggests that the models' deep comprehension of compositional semantics might be an illusion masked by their performance on standard benchmarks.
Semantic Sensitivity of NLI Models
The paper introduces a systematic framework to measure semantic sensitivity by utilizing LLMs to generate minor variations of hypothesis statements that maintain semantic equivalence. When these generated statements are evaluated against the original premise, significant changes in the models' original predictions are observed. This inconsistency occurs despite the models having previously identified correct relations between the premise and the original hypothesis. Strikingly, model performance shows an average degradation of 12.92% and 23.71% in both in-domain and out-of-domain settings, respectively.
Investigating Model Performance Across Datasets and Architectures
The paper's approach investigates a spectrum of transformer architectures, including RoBERTa, BART, DeBERTa, and DistilBart, across multiple NLI datasets. The findings suggest a pervasive issue of semantic sensitivity that is apparently independent of model size or training domain. Interestingly, when comparing distilled models to their larger counterparts, the distilled versions exhibit higher sensitivity to semantic variation, suggesting knowledge of compositional semantics is not robustly transferred during distillation.
Impact on Predictive Consistency and Implications
Further analysis indicates that the semantic sensitivity leads not only to performance degradation but also to inconsistencies within predictions. Evaluations show how models demonstrate fluctuating confidence and a tendency to make contradictory decisions when faced with semantically equivalent variations. This affects the model's robustness and calls into question their reliability for tasks requiring an understanding of nuanced semantic structure.
Conclusion
This research positions itself as a critical reflective on the presumed comprehension abilities of transformer-based NLI models. While the models excel on standard benchmarks, their understanding of semantic subtleties is proved to be more ambiguous and less robust than previously thought. This paper stands as a call for more rigorous testing methods that engage the finer points of language comprehension, beyond the blunt instruments of current benchmarks, to truly ascertain the capabilities of LMs in semantic understanding.