Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities (2401.07078v1)

Published 13 Jan 2024 in cs.CL
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

Abstract: LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller LLMs. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

Introduction to Pragmatics in LLMs

The field of NLP has been revolutionized by LLMs capable of performing a wide range of language-based tasks with increasing competency. An important aspect of language understanding is pragmatics - the ability to interpret language based on context, intentions, presuppositions, and implied meanings. Although LLMs excel at understanding semantics, their ability to grasp pragmatics is not as well studied. A recent research effort evaluates this by introducing a benchmark called the Pragmatics Understanding Benchmark (PUB).

Evaluating LLMs with PUB

PUB consists of 28,000 data entries, specially curated for 14 tasks over four pragmatic phenomena: Implicature, Presupposition, Reference, and Deixis. The tasks revolve around Multiple Choice Question Answers (MCQA), simulating real-world language use scenarios. In this comprehensive benchmark paper, a wide range of models, including base and chat-adapted versions varying in size and training approach, were evaluated. The research illuminates the effectiveness of fine-tuning small models for instruction-following and chat tasks in enhancing pragmatic understanding.

Interpretation of Pragmatic Phenomena

The benchmark looks into distinguishing indirect from direct responses, classifying responses, implicature recovery in dialogue contexts, and several other tasks that involve figurative language such as sarcasm detection and agreement. It becomes evident through this paper that instruction-tuned and chat-optimized LLMs exhibit improved pragmatic capabilities over their base counterparts. However, large models, despite their size, do not always maintain superiority in pragmatics, with some showing comparable performance to their chat-adapted equivalents.

Insights and Future Directions

Notwithstanding significant progress, LLMs have yet to match human-level pragmatics. Human evaluations maintain consistent performance across tasks, whereas models show varied proficiency, indicating room for improvement. One clear takeaway is the importance of context-based understanding for LLMs to provide more nuanced and human-like interactions. The PUB has substantiated certain gaps in LLMs' abilities to fully comprehend pragmatics and is expected to steer further research towards refining their interactive abilities, moving closer to a genuine conversational understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Jens Allwood. 1981. On the distinctions between semantics and pragmatics. In Crossing the Boundaries in Linguistics: Studies Presented to Manfred Bierwisch, pages 177–189. Springer.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  3. FLUTE: figurative language understanding through textual explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 7139–7159. Association for Computational Linguistics.
  4. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  5. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  6. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  7. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  8. Joint inference and disambiguation of implicit sentiments via implicature constraints. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland, pages 79–88. ACL.
  9. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361.
  10. Herbert P Grice. 1975. Logic and conversation. In Speech acts, pages 41–58. Brill.
  11. Laurence R Horn and Gregory L Ward. 2004. The handbook of pragmatics. Wiley Online Library.
  12. A fine-grained comparison of pragmatic language understanding in humans and language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4194–4213. Association for Computational Linguistics.
  13. Are natural language inference models imppressive? learning implicature and presupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8690–8705. Association for Computational Linguistics.
  14. Jad Kabbara and Jackie Chi Kit Cheung. 2022. Investigating the performance of transformer-based NLI models on presuppositional inferences. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 779–785. International Committee on Computational Linguistics.
  15. (qa)22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Question answering with questionable assumptions. CoRR, abs/2212.10003.
  16. Shibamouli Lahiri. 2015. Squinky! A corpus of sentence-level formality, informativeness, and implicature. CoRR, abs/1506.02306.
  17. George Lakoff and Mark Johnson. 2008. Metaphors we live by. University of Chicago press.
  18. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pages 986–995. Asian Federation of Natural Language Processing.
  19. Testing the ability of language models to interpret figurative language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 4437–4452. Association for Computational Linguistics.
  20. " i’d rather just go to bed": Understanding indirect answers. arXiv preprint arXiv:2010.03450.
  21. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2381–2391. Association for Computational Linguistics.
  22. NOPE: A corpus of naturally-occurring presuppositions in english. In Proceedings of the 25th Conference on Computational Natural Language Learning, CoNLL 2021, Online, November 10-11, 2021, pages 349–366. Association for Computational Linguistics.
  23. Pragmaticqa: A dataset for pragmatic question answering in conversations. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6175–6191. Association for Computational Linguistics.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  25. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 784–789. Association for Computational Linguistics.
  26. Joshua Robinson and David Wingate. 2023. Leveraging large language models for multiple choice question answering. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  27. The goldilocks of pragmatic understanding: Fine-tuning strategy matters for implicature resolution by llms. In Thirty-seventh Conference on Neural Information Processing Systems.
  28. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  29. A pragmatics-centered evaluation framework for natural language understanding. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 2382–2394. European Language Resources Association.
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  31. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3261–3275.
  33. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  34. Albert Webson and Ellie Pavlick. 2021. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247.
  35. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  36. George Yule. 1996. Pragmatics. Oxford university press.
  37. GRICE: A grammar-based dataset for recovering implicature and conversational reasoning. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 2074–2085. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Settaluri Lakshmi Sravanthi (2 papers)
  2. Meet Doshi (4 papers)
  3. Tankala Pavan Kalyan (1 paper)
  4. Rudra Murthy (14 papers)
  5. Pushpak Bhattacharyya (153 papers)
  6. Raj Dabre (65 papers)
Citations (10)