PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities (2401.07078v1)

Published 13 Jan 2024 in cs.CL

Abstract: LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of Multiple Choice Question Answers (MCQA). PUB includes a total of 28k data points, 6.1k of which have been created by us, and the rest are adapted from existing datasets. We evaluated nine models varying in the number of parameters and type of training. Our study indicates that fine-tuning for instruction-following and chat significantly enhances the pragmatics capabilities of smaller LLMs. However, for larger models, the base versions perform comparably with their chat-adapted counterparts. Additionally, there is a noticeable performance gap between human capabilities and model capabilities. Furthermore, unlike the consistent performance of humans across various tasks, the models demonstrate variability in their proficiency, with performance levels fluctuating due to different hints and the complexities of tasks within the same dataset. Overall, the benchmark aims to provide a comprehensive evaluation of LLM's ability to handle real-world language tasks that require pragmatic reasoning.

PDF HTML Abstract

Introduction to Pragmatics in LLMs

The field of NLP has been revolutionized by LLMs capable of performing a wide range of language-based tasks with increasing competency. An important aspect of language understanding is pragmatics - the ability to interpret language based on context, intentions, presuppositions, and implied meanings. Although LLMs excel at understanding semantics, their ability to grasp pragmatics is not as well studied. A recent research effort evaluates this by introducing a benchmark called the Pragmatics Understanding Benchmark (PUB).

Evaluating LLMs with PUB

PUB consists of 28,000 data entries, specially curated for 14 tasks over four pragmatic phenomena: Implicature, Presupposition, Reference, and Deixis. The tasks revolve around Multiple Choice Question Answers (MCQA), simulating real-world language use scenarios. In this comprehensive benchmark paper, a wide range of models, including base and chat-adapted versions varying in size and training approach, were evaluated. The research illuminates the effectiveness of fine-tuning small models for instruction-following and chat tasks in enhancing pragmatic understanding.

Interpretation of Pragmatic Phenomena

The benchmark looks into distinguishing indirect from direct responses, classifying responses, implicature recovery in dialogue contexts, and several other tasks that involve figurative language such as sarcasm detection and agreement. It becomes evident through this paper that instruction-tuned and chat-optimized LLMs exhibit improved pragmatic capabilities over their base counterparts. However, large models, despite their size, do not always maintain superiority in pragmatics, with some showing comparable performance to their chat-adapted equivalents.

Insights and Future Directions

Notwithstanding significant progress, LLMs have yet to match human-level pragmatics. Human evaluations maintain consistent performance across tasks, whereas models show varied proficiency, indicating room for improvement. One clear takeaway is the importance of context-based understanding for LLMs to provide more nuanced and human-like interactions. The PUB has substantiated certain gaps in LLMs' abilities to fully comprehend pragmatics and is expected to steer further research towards refining their interactive abilities, moving closer to a genuine conversational understanding.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (6)

Settaluri Lakshmi Sravanthi (2 papers)
Meet Doshi (4 papers)
Tankala Pavan Kalyan (1 paper)
Rudra Murthy (14 papers)
Pushpak Bhattacharyya (153 papers)
Raj Dabre (65 papers)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/prajdabre1/status/1821832389313884284

https://twitter.com/prajdabre1/status/1748240049479491731