Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models (2401.14440v2)

Published 25 Jan 2024 in cs.CL, cs.AI, cs.CY, and cs.LG

Abstract: Recent studies of the emergent capabilities of transformer-based Natural Language Understanding (NLU) models have indicated that they have an understanding of lexical and compositional semantics. We provide evidence that suggests these claims should be taken with a grain of salt: we find that state-of-the-art Natural Language Inference (NLI) models are sensitive towards minor semantics preserving surface-form variations, which lead to sizable inconsistent model decisions during inference. Notably, this behaviour differs from valid and in-depth comprehension of compositional semantics, however does neither emerge when evaluating model accuracy on standard benchmarks nor when probing for syntactic, monotonic, and logically robust reasoning. We propose a novel framework to measure the extent of semantic sensitivity. To this end, we evaluate NLI models on adversarially generated examples containing minor semantics-preserving surface-form input noise. This is achieved using conditional text generation, with the explicit condition that the NLI model predicts the relationship between the original and adversarial inputs as a symmetric equivalence entailment. We systematically study the effects of the phenomenon across NLI models for $\textbf{in-}$ and $\textbf{out-of-}$ domain settings. Our experiments show that semantic sensitivity causes performance degradations of $12.92\%$ and $23.71\%$ average over $\textbf{in-}$ and $\textbf{out-of-}$ domain settings, respectively. We further perform ablation studies, analysing this phenomenon across models, datasets, and variations in inference and show that semantic sensitivity can lead to major inconsistency within model predictions.

Introduction

Transformer-based LLMs (LMs) have shifted the landscape of Natural Language Understanding (NLU), with performance benchmarks suggesting a high capability for syntactic, logical, and semantic comprehension. This paper presents evidence that such claims may be overstated, as state-of-the-art Natural Language Inference (NLI) models demonstrate significant sensitivity to minor, semantics-preserving variations in surface form. This suggests that the models' deep comprehension of compositional semantics might be an illusion masked by their performance on standard benchmarks.

Semantic Sensitivity of NLI Models

The paper introduces a systematic framework to measure semantic sensitivity by utilizing LLMs to generate minor variations of hypothesis statements that maintain semantic equivalence. When these generated statements are evaluated against the original premise, significant changes in the models' original predictions are observed. This inconsistency occurs despite the models having previously identified correct relations between the premise and the original hypothesis. Strikingly, model performance shows an average degradation of 12.92% and 23.71% in both in-domain and out-of-domain settings, respectively.

Investigating Model Performance Across Datasets and Architectures

The paper's approach investigates a spectrum of transformer architectures, including RoBERTa, BART, DeBERTa, and DistilBart, across multiple NLI datasets. The findings suggest a pervasive issue of semantic sensitivity that is apparently independent of model size or training domain. Interestingly, when comparing distilled models to their larger counterparts, the distilled versions exhibit higher sensitivity to semantic variation, suggesting knowledge of compositional semantics is not robustly transferred during distillation.

Impact on Predictive Consistency and Implications

Further analysis indicates that the semantic sensitivity leads not only to performance degradation but also to inconsistencies within predictions. Evaluations show how models demonstrate fluctuating confidence and a tendency to make contradictory decisions when faced with semantically equivalent variations. This affects the model's robustness and calls into question their reliability for tasks requiring an understanding of nuanced semantic structure.

Conclusion

This research positions itself as a critical reflective on the presumed comprehension abilities of transformer-based NLI models. While the models excel on standard benchmarks, their understanding of semantic subtleties is proved to be more ambiguous and less robust than previously thought. This paper stands as a call for more rigorous testing methods that engage the finer points of language comprehension, beyond the blunt instruments of current benchmarks, to truly ascertain the capabilities of LMs in semantic understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. James Allen. 1995. Natural language understanding. Benjamin-Cummings Publishing Co., Inc.
  2. Representation of constituents in neural language models: Coordination phrase as a case study. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2888–2899, Hong Kong, China. Association for Computational Linguistics.
  3. Richard B Angell. 1989. Deducibility, entailment and analytic containment. Directions in relevant logic, pages 119–143.
  4. The berkeley framenet project. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics.
  5. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
  6. Vance W Berger and YanYan Zhou. 2014. Kolmogorov–smirnov test: Overview. Wiley statsref: Statistics reference online.
  7. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  8. Rudolf Carnap. 1959. Introduction to semantics and formalization of logic. Harvard University Press.
  9. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069.
  10. Scaling instruction-finetuned language models.
  11. Aaron Cicourel. 1991. Semantics, pragmatics, and situated meaning. Pragmatics at Issue, 1:37–66.
  12. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy. Association for Computational Linguistics.
  13. Michael Clark. 1967. The general notion of entailment. The Philosophical Quarterly (1950-), 17(68):231–245.
  14. Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 workshop on Text meaning, pages 38–45.
  15. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pages 177–190. Springer.
  16. Evaluating compositionality in sentence embeddings.
  17. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051.
  18. Bent Fuglede and Flemming Topsoe. 2004. Jensen-shannon divergence and hilbert space embedding. In International symposium onInformation theory, 2004. ISIT 2004. Proceedings., page 31. IEEE.
  19. Neural natural language inference models partially embed theories of lexical entailment and negation. arXiv preprint arXiv:2004.14623.
  20. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Computational Linguistics.
  21. Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
  22. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  23. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  24. John Hewitt and Christopher D Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138.
  25. Behavior analysis of nli models: Uncovering the influence of three factors on robustness. In NAACL HLT 2018-2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies-Proceedings of the Conference, volume 2018, pages 1975–1985. Association for Computational Linguistics (ACL).
  26. Pauline I Jacobson. 2014. Compositional semantics: An introduction to the syntax/semantics interface. Oxford Textbooks in Linguistic.
  27. What does bert learn about the structure of language? In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics.
  28. Are natural language inference models imppressive? learning implicature and presupposition. arXiv preprint arXiv:2004.03066.
  29. Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online. Association for Computational Linguistics.
  30. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  31. James M Joyce. 2011. Kullback-leibler divergence. In International encyclopedia of statistical science, pages 720–722. Springer.
  32. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
  33. Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593.
  34. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.
  35. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
  36. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855.
  37. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  38. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics.
  39. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1765–1773.
  40. Stress test evaluation for natural language inference. arXiv preprint arXiv:1806.00692.
  41. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  42. Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
  43. EQUATE: A benchmark evaluation framework for quantitative reasoning in natural language inference. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 349–361, Hong Kong, China. Association for Computational Linguistics.
  44. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
  45. Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8713–8721.
  46. Context-dependent semantic processing in the human brain: Evidence from idiom comprehension. Journal of Cognitive Neuroscience, 25(5):762–776.
  47. Knowledge-aware language model pretraining. arXiv preprint arXiv:2007.00655.
  48. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  49. Stephen Schiffer. 1986. Compositional semantics and language understanding. Philosophical Grounds of Rationality: Intentions, Categories, Ends, Oxford, pages 174–207.
  50. Unnatural language inference. arXiv preprint arXiv:2101.00010.
  51. Unnatural language inference.
  52. Pragmatic presuppositions. In Proceedings of the Texas conference on per~ formatives, presuppositions, and implicatures. Arlington, VA: Center for Applied Linguistics, pages 135–148. ERIC.
  53. Attention is all you need. Advances in neural information processing systems, 30.
  54. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
  55. Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940.
  56. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  57. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  58. Shuohang Wang and Jing Jiang. 2015. Learning natural language inference with lstm. arXiv preprint arXiv:1512.08849.
  59. Alex Warstadt and Samuel R Bowman. 2020. Can neural networks acquire a structural bias from raw linguistic data? arXiv preprint arXiv:2007.06761.
  60. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  61. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  62. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  63. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  64. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  65. Can neural networks understand monotonicity reasoning? arXiv preprint arXiv:1906.06448.
  66. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161.
  67. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326.
  68. Semantics-aware bert for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9628–9635.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Erik Arakelyan (11 papers)
  2. Zhaoqi Liu (2 papers)
  3. Isabelle Augenstein (131 papers)
Citations (6)