Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can large language models understand uncommon meanings of common words? (2405.05741v1)

Published 9 May 2024 in cs.CL and cs.AI

Abstract: LLMs like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 .
  2. Self-RAG: Learning to retrieve, generate, and critique through self-reflection, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=hSyW5go0v8.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609 .
  4. Baichuan, 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 URL: https://arxiv.org/abs/2309.10305.
  5. On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623.
  6. Stochastic parrots or intelligent systems? a perspective on true depth of understanding in llms. A Perspective on True Depth of Understanding in LLMs (July 11, 2023) .
  7. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .
  9. Overview of text visualization techniques. Introduction to Text Visualization , 11–40.
  10. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 .
  11. Reading Wikipedia to answer open-domain questions, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vancouver, Canada. pp. 1870–1879. URL: https://aclanthology.org/P17-1171, doi:10.18653/v1/P17-1171.
  12. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=EHg5GDnyq1.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) .
  14. Do LLMs understand social knowledge? evaluating the sociability of large language models with SocKET benchmark, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore. pp. 11370–11403. URL: https://aclanthology.org/2023.emnlp-main.699, doi:10.18653/v1/2023.emnlp-main.699.
  15. Understanding old words with new meanings. Journal of verbal learning and verbal behavior 22, 591–608.
  16. From birth to sixteen: Children’s health, social, emotional and linguistic development. Routledge.
  17. Common words, uncommon meanings: Evidence for widespread gender differences in word meaning., in: Proceedings of the Annual Meeting of the Cognitive Science Society.
  18. Glm: General language model pretraining with autoregressive blank infilling, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335.
  19. Common words with uncommon meanings .
  20. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 .
  21. The theoretical and descriptive development of lexical semantics. The lexicon in focus. Competition and convergence in current lexicology , 23–42.
  22. 16-year-old child development milestones: Your child’s growth and development at age 16. Very well Family. Medically reviewed by a board-certified physician .
  23. Are large language models intelligent? are humans?, in: Computer Sciences & Mathematics Forum, MDPI. p. 68.
  24. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR) .
  25. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 .
  26. Can large language models truly understand prompts? a case study with negated prompts, in: Transfer Learning for Natural Language Processing Workshop, PMLR. pp. 52–62.
  27. Large language models struggle to learn long-tail knowledge, in: International Conference on Machine Learning, PMLR. pp. 15696–15707.
  28. Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213.
  29. The dark side of chatgpt: legal and ethical challenges from stochastic parrots and hallucination. arXiv preprint arXiv:2304.14347 .
  30. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688 .
  31. Insight; a study of human understanding. .
  32. Search augmented instruction learning, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore. pp. 3717–3729. URL: https://aclanthology.org/2023.findings-emnlp.242, doi:10.18653/v1/2023.findings-emnlp.242.
  33. The psychology of communication. Human Resource Management 6, 43.
  34. OpenAI, 2023. Introducing chatgpt.
  35. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744.
  36. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering, in: Conference on Health, Inference, and Learning, PMLR. pp. 248–260.
  37. An empirical study on the language modal in visual question answering, in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, International Joint Conferences on Artificial Intelligence Organization. pp. 4109–4117. URL: https://doi.org/10.24963/ijcai.2023/457, doi:10.24963/ijcai.2023/457. main Track.
  38. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online. pp. 5835–5847. URL: https://aclanthology.org/2021.naacl-main.466, doi:10.18653/v1/2021.naacl-main.466.
  39. Explaining large language model-based neural semantic parsers (student abstract), in: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI Press. URL: https://doi.org/10.1609/aaai.v37i13.27014, doi:10.1609/aaai.v37i13.27014.
  40. The two word test: A semantic benchmark for large language models. arXiv preprint arXiv:2306.04610 .
  41. Leveraging large language models for multiple choice question answering, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=yKbprarjc5B.
  42. Large language models can be easily distracted by irrelevant context, in: International Conference on Machine Learning, PMLR. pp. 31210–31227.
  43. Prompting gpt-3 to be reliable, in: International Conference on Learning Representations (ICLR). URL: https://arxiv.org/abs/2210.09150.
  44. Components of human intelligence. Cognition 15, 1–48.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .
  46. GLUE: A multi-task benchmark and analysis platform for natural language understanding, in: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics, Brussels, Belgium. pp. 353–355. URL: https://aclanthology.org/W18-5446, doi:10.18653/v1/W18-5446.
  47. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837.
  48. Academically intelligent llms are not necessarily socially intelligent. arXiv preprint arXiv:2403.06591 .
  49. Intuitive or dependent? investigating llms’ robustness to conflicting prompts. arXiv preprint arXiv:2309.17415 .
  50. Making retrieval-augmented language models robust to irrelevant context, in: The Twelfth International Conference on Learning Representations. URL: https://openreview.net/forum?id=ZS4m74kZpH.
  51. Generate rather than retrieve: Large language models are strong context generators, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=fB0hRu9GZUS.
  52. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474 .
  53. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473 .
  54. FewNLU: Benchmarking state-of-the-art methods for few-shot natural language understanding, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 501–516. URL: https://aclanthology.org/2022.acl-long.38, doi:10.18653/v1/2022.acl-long.38.
  55. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528 .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jinyang Wu (11 papers)
  2. Feihu Che (13 papers)
  3. Xinxin Zheng (2 papers)
  4. Shuai Zhang (319 papers)
  5. Ruihan Jin (6 papers)
  6. Shuai Nie (17 papers)
  7. Pengpeng Shao (14 papers)
  8. Jianhua Tao (139 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com