Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds (2305.14785v2)

Published 24 May 2023 in cs.CL and cs.AI

Abstract: We evaluate LLMs' language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as ``blinds'' overshadowing the semantics of the embedded premise.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Presupposition. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy, Spring 2021 edition. Metaphysics Research Lab, Stanford University.
  2. Ali Borji. 2023. A categorical archive of chatgpt failures.
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  4. Tyler A. Chang and Benjamin K. Bergen. 2023. Language Model Behavior: A Comprehensive Survey. Computational Linguistics, pages 1–55.
  5. Yan Cong. 2022. Psycholinguistic diagnosis of language models’ commonsense reasoning. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 17–22, Dublin, Ireland. Association for Computational Linguistics.
  6. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pages 177–190. Springer.
  7. Forrest Davis. 2022. Incremental processing of principle B: Mismatches between neural models and humans. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 144–156, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  8. The commitmentbank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107–124.
  9. Allyson Ettinger. 2020. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models. Transactions of the Association for Computational Linguistics, 8:34–48.
  10. Kai Fintel. 2008. What is presupposition accommodation, again? Philosophical Perspectives, 22:137–170.
  11. Neural natural language inference models partially embed theories of lexical entailment and negation.
  12. Probing linguistic systematicity.
  13. Nicolas Guerin and Emmanuel Chemla. 2023. It is a bird therefore it is a robin: On BERT’s internal consistency between hypernym knowledge and logical words. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8807–8817, Toronto, Canada. Association for Computational Linguistics.
  14. A multilingual benchmark for probing negation-awareness with minimal pairs. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 244–257, Online. Association for Computational Linguistics.
  15. Dagmar Haumann. 2007. Adverb Licensing and Clause Structure in English.
  16. An analysis of negation in natural language understanding corpora. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 716–723, Dublin, Ireland. Association for Computational Linguistics.
  17. An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9106–9118, Online. Association for Computational Linguistics.
  18. Yan Huang. 2011. 14. Types of inference: entailment, presupposition, and implicature, pages 397–422. De Gruyter Mouton, Berlin, New York.
  19. Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency analysis of chatgpt.
  20. Beyond distributional hypothesis: Let language models learn meaning-text correspondence. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2030–2042, Seattle, United States. Association for Computational Linguistics.
  21. Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online. Association for Computational Linguistics.
  22. Nanjiang Jiang and Marie-Catherine de Marneffe. 2019. Evaluating BERT for natural language inference: A case study on the CommitmentBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6086–6091, Hong Kong, China. Association for Computational Linguistics.
  23. Language models use monotonicity to assess NPI licensing. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4958–4969, Online. Association for Computational Linguistics.
  24. Jad Kabbara and Jackie Chi Kit Cheung. 2022. Investigating the performance of transformer-based NLI models on presuppositional inferences. In Proceedings of the 29th International Conference on Computational Linguistics, pages 779–785, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  25. Lauri Karttunen. 2016. Presupposition: What went wrong? Semantics and Linguistic Theory, 26:705–731.
  26. Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Online. Association for Computational Linguistics.
  27. Paul Kiparsky and Carol Kiparsky. 1970. FACT, pages 143–173. De Gruyter Mouton, Berlin, Boston.
  28. Large language models are zero-shot reasoners.
  29. Evaluating the logical reasoning ability of chatgpt and gpt-4.
  30. Isabelle Lorge and Janet B. Pierrehumbert. 2023. Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained language models. ArXiv, abs/2305.16426.
  31. Bill MacCartney and Christopher D. Manning. 2008. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 521–528, Manchester, UK. Coling 2008 Organizing Committee.
  32. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  33. Entailment semantics can be extracted from an ideal language model. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 176–193, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  34. NOPE: A corpus of naturally-occurring presuppositions in English. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 349–366, Online. Association for Computational Linguistics.
  35. Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of chatgpt-3.5, chatgpt-4, and google bard.
  36. Is chatgpt a general-purpose natural language processing task solver?
  37. Condaqa: A contrastive reading comprehension dataset for reasoning about negation.
  38. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
  39. A primer in bertology: What we know about how bert works.
  40. Alexis Ross and Ellie Pavlick. 2019. How well do NLI models capture verb veridicality? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2230–2240, Hong Kong, China. Association for Computational Linguistics.
  41. In chatgpt we trust? measuring and characterizing the reliability of chatgpt.
  42. Elias Stengel-Eskin and Benjamin Van Durme. 2022. The curious case of control. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11065–11076, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  43. Llama 2: Open foundation and fine-tuned chat models.
  44. Improving negation detection with negation-focused pre-training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4188–4193, Seattle, United States. Association for Computational Linguistics.
  45. Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 101–114, Toronto, Canada. Association for Computational Linguistics.
  46. On the robustness of chatgpt: An adversarial and out-of-distribution perspective.
  47. Deirdre Wilson and Dan Sperber. 1979. Ordered Entailments: An Alternative to Presuppositional Theories, volume 11, pages 299–323.
  48. Do neural models learn systematicity of monotonicity inference in natural language? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6105–6117, Online. Association for Computational Linguistics.
  49. Can neural networks understand monotonicity reasoning? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 31–40, Florence, Italy. Association for Computational Linguistics.
  50. HELP: A dataset for identifying shortcomings of neural models in monotonicity reasoning. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 250–255, Minneapolis, Minnesota. Association for Computational Linguistics.
  51. Exploring transitivity in neural nli models through veridicality.
  52. Assessing step-by-step reasoning against lexical negation: A case study on syllogism. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14753–14773, Singapore. Association for Computational Linguistics.
  53. Can language models be tricked by language illusions? easier with syntax, harder with semantics. In Conference on Computational Natural Language Learning.
  54. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert.
  55. Guido Zuccon and Bevan Koopman. 2023. Dr chatgpt, tell me what i want to hear: How prompt knowledge impacts health answer correctness.
Citations (4)

Summary

  • The paper demonstrates that even advanced LLMs, including GPT-4, exhibit systematic blind spots in basic linguistic inference tasks.
  • The study employs zero-shot and chain-of-thought setups to isolate challenges in grammatically-specified entailments, evidential adverbs, and monotonicity entailments.
  • These findings suggest that current pre-training paradigms need refinement to capture essential linguistic nuances and improve model comprehension.

Analyzing Linguistic Inference Capabilities and Limitations of LLMs

The paper "Simple Linguistic Inferences of LLMs: Blind Spots and Blinds" conducts a thorough exploration of LLMs in terms of their ability to make simple linguistic inferences that are trivial for humans. With a focus on specific inference tasks, the authors dissect both the strengths and notable limitations of LLMs, thereby advancing our understanding of these models' linguistic competence.

Core Linguistic Inference Tasks and Methodology

The exploration focuses on three types of linguistic inferences: grammatically-specified entailments, usage of evidential adverbs of uncertainty, and monotonicity entailments. Each type represents a fundamental aspect of linguistic understanding that humans typically process without difficulty. The paper encompasses experiments across several LLMs in zero-shot and chain-of-thought setups, covering both isolated linguistic phenomena and those embedded within syntactic structures designed to either reinforce or suppress entailment.

Results and Observations

The experimental results reveal a moderate to low performance of LLMs on the selected inference tasks, with a particularly stark contrast to human-level performance. Notably, even state-of-the-art models like GPT-4 failed to consistently achieve human-level accuracy across all tasks. The models frequently struggled with distinguishing entailments when premises were embedded within linguistic contexts like presupposition triggers or non-factives. This reveals systematic blind spots in their comprehension abilities.

While the GPT-4 model demonstrated some improvement over its predecessors, particularly in handling certain inference tasks more accurately, it still fell short in terms of reaching human-like performance. This suggests that fundamental limitations persist even in the most advanced models available today.

Practical and Theoretical Implications

The research highlights significant gaps in LLMs' ability to process natural language in a human-like manner. These gaps are particularly apparent in tasks involving nuances such as evidential adverbs and logically straightforward entailments, which have not been adequately captured by pre-training data and methodologies. These findings raise important questions about the models' linguistic competence, suggesting that current pre-training paradigms may not be sufficient for encoding all necessary linguistic nuances.

The persistence of these limitations indicates that future research should focus on developing techniques or architectures capable of overcoming these blind spots. This could involve refining training data, incorporating deeper linguistic theories into model development, or revisiting the models' interpretative frameworks.

Conclusion and Future Directions

The paper underscores the imperative need for continued research to address these systematic deficiencies in LLMs. It brings to the forefront the importance of developing richer, more nuanced evaluation benchmarks and methodologies aimed at capturing the extent of linguistic understanding. Consequently, this research not only enhances our grasp of current LLM capacities but also sets the stage for future advancements in artificial intelligence and natural language processing.

Overall, the paper provides invaluable insights into the nuanced domain of linguistic inferences and reaffirms the necessity for ongoing inquiry into the shortcomings of LLMs, encouraging a transition from superficial accuracy toward genuine semantic understanding.

X Twitter Logo Streamline Icon: https://streamlinehq.com