Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs (2210.14986v2)

Published 26 Oct 2022 in cs.CL

Abstract: Despite widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context -- incorporating its pragmatics. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response "I wore gloves" to the question "Did you leave fingerprints?" as meaning "No". To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference (yes or no), models in three of these categories perform close to random. However, LLMs instruction-tuned at the example-level perform significantly better. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models. We present our findings as the starting point for further research into evaluating how LLMs interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

The paper entitled "The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs" by Laura Ruis et al. investigates the ability of LLMs to resolve implicatures—an essential facet of pragmatic language understanding. By establishing a benchmark that evaluates LLMs’ ability to make binary inferences, the authors hypothesize and demonstrate that the fine-tuning strategy significantly impacts the model’s performance on such tasks.

The paper emphasizes the importance of context in language comprehension. Through the lens of pragmatic understanding, the paper explores how LLMs interpret implied meanings, which extend beyond literal sentence constructions, such as understanding the implicature within the utterance-response "Did you leave fingerprints?" / "I wore gloves", which implies a negative response without explicitly stating it.

The experimental setup involved evaluating different categories of LLMs, specifically focusing on those fine-tuned on diverse sets of instructions. By employing a robust evaluation methodology that includes multiple zero-shot and few-shot prompt templates, the research explores how variations in fine-tuning affect LLM performance on this pragmatic task.

Key Findings and Contributions

  1. Instruction-Level Fine-Tuning:
    • Models fine-tuned with instructions at the example level demonstrated superior performance in implicature resolution relative to other fine-tuning strategies. Specifically, when compared to baseline models that solely depend on large-scale pre-training, these instruction-tuned models better grasped the pragmatic nuances necessary for resolving implicatures.
  2. Model Category Performance:
    • Four distinct categories of LLMs were evaluated: base models, dialogue fine-tuned models, models with benchmark-level instruction-tuning, and models with example-level instruction-tuning. The latter category outperformed others significantly, suggesting that example-level instruction fine-tuning is more effective in cultivating pragmatic understanding.
  3. Scaling Analysis and Performance Implications:
    • While model size correlates with performance improvement, the scaling properties notably favored example instruction-tuned models. This insight points towards pre-training being a foundational requirement for implication comprehension, albeit without guaranteeing pragmatic understanding absent appropriate fine-tuning.
  4. Human-Level Accuracy with CoT:
    • Chain-of-thought prompting further enhanced the performance of the most advanced models, such as GPT-4, allowing it to achieve human-level performance on the implicature resolution benchmark. This result underscores the potential of methodologies that allow models to reason through tasks explicitly.
  5. Robustness Across Templates:
    • The paper robustly tested the models across various prompt templates. The consistency in performance across these prompts suggests that the results are generalizable, reducing concerns about template-induced bias or variability in model predictions.

Implications and Future Directions

The findings of this paper have significant implications for both the development and evaluation of LLMs. Firstly, the importance of fine-tuning emerges as a crucial variable that cannot be overlooked if models are to be applied in contexts requiring human-like understanding and conversation refinement. This suggests a potential avenue for future work where fine-tuning is emphasized not only as an augmentation step for LLM capabilities but as a shaping process of fundamental competencies like pragmatic understanding.

Moreover, the interplay between scale, such as larger parameter sets, and specific fine-tuning techniques indicates potential cost-effective strategies in model training that maximize resource utilization without larger computational overhead. This might inspire the development of new architectures or training paradigms that inherently incorporate the principles of example-level instruction-tuning to achieve intuitively human-like comprehension.

In conclusion, Ruis et al. provide a compelling examination of the ability of LLMs to resolve implicatures and set a future precedent for how training data should be treated to evoke desired competencies in artificial conversational agents. Through the paper’s investigative rigor and encompassing analysis, it contributes to a deeper understanding of how task-specific tuning can enhance the bridge between semantic and pragmatic model capabilities in LLMs, advancing the field toward more contextually aware AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Boosting search engines with interactive agents. Transactions on Machine Learning Research.
  2. American Psychiatric Association, A. P. A. (2013). Diagnostic and statistical manual of mental disorders : DSM-5. American Psychiatric Association Arlington, VA, 5th ed. edition.
  3. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
  4. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  5. Bach, K. (1999). The myth of conventional implicature. Linguistics and Philosophy, 22(4):327–366.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  7. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623.
  8. Fooling moss detection with pretrained language models. arXiv preprint arXiv:2201.07406.
  9. BigScience (2022). Bigscience language open-science open-access multilingual (bloom) language model.
  10. GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models.
  11. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  12. Scaling instruction-finetuned language models.
  13. Let’s do it “again”: A first computational approach to detecting adverbial presupposition triggers. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2747–2755, Melbourne, Australia. Association for Computational Linguistics.
  14. Davis, W. (2019). Implicature. In Zalta, E. N., editor, The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2019 edition.
  15. Davis, W. A. (1998). Implicature : intention, convention, and principle in the failure of Gricean theory / Wayne A. Davis. Cambridge studies in philosophy. Cambridge University Press, Cambridge England ; New York.
  16. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  17. The turking test: Can language models understand instructions? CoRR, abs/2010.11982.
  18. Predicting pragmatic reasoning in language games. Science, 336:998 – 998.
  19. A framework for few-shot language model evaluation.
  20. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  21. Conversational implicatures in english dialogue: Annotated dataset. Procedia Computer Science, 171:2316–2323. https://doi.org/10.1016/j.procs.2020.04.251.
  22. Improving alignment of dialogue agents via targeted human judgements.
  23. Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20.
  24. Green, G. (1996). Pragmatics and Natural Language Understanding. Tutorial essays in cognitive science. Erlbaum.
  25. More than words: Syntactic packaging and implicit sentiment. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 503–511, Boulder, Colorado. Association for Computational Linguistics.
  26. Grice, H. P. (1975). Logic and conversation. In Cole, P. and Morgan, J. L., editors, Syntax and Semantics: Vol. 3: Speech Acts, pages 41–58. Academic Press, New York.
  27. Resolving indirect referring expressions for entity selection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12313–12335, Toronto, Canada. Association for Computational Linguistics.
  28. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207.
  29. Huang, Y. (2017). The Oxford Handbook of Pragmatics. Oxford handbooks in linguistics. Oxford University Press.
  30. Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online. Association for Computational Linguistics.
  31. In conversation with artificial intelligence: aligning language models with human values.
  32. Are children with specific language impairment competent with the pragmatics and logic of quantification? Cognition, 119(1):43–57.
  33. Alignment of language agents. CoRR, abs/2103.14659.
  34. Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
  35. Which linguist invented the lightbulb? presupposition verification for question-answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3932–3945, Online. Association for Computational Linguistics.
  36. Ask me what you need: Product retrieval using knowledge from gpt-3. arXiv preprint arXiv:2207.02516.
  37. Large language models are zero-shot reasoners.
  38. Imagination and Convention: Distinguishing Grammar and Inference in Language. Oxford University Press.
  39. Levinson, S. C. (1983). Pragmatics. Cambridge University Press, Cambridge, U.K.
  40. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  41. Predicting scalar inferences from “or” to “not both” using neural sentence encoders. In Proceedings of the Society for Computation in Linguistics 2021, pages 446–450, Online. Association for Computational Linguistics.
  42. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  43. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  44. “I’d rather just go to bed”: Understanding indirect answers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7411–7425, Online. Association for Computational Linguistics.
  45. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In ACL.
  46. Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 314–319, Florence, Italy. Association for Computational Linguistics.
  47. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
  48. NOPE: A corpus of naturally-occurring presuppositions in English. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 349–366, Online. Association for Computational Linguistics.
  49. “was it “stated” or was it “claimed”?: How linguistic bias affects generative language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10080–10095, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  50. Potts, C. (2005). The Logic of Conventional Implicatures. Oxford University Press UK.
  51. Potts, C. (2006). Conversational implicatures via general pragmatic pressures. In Washio, T., Satoh, K., Takeda, H., and Inokuchi, A., editors, New Frontiers in Artificial Intelligence, pages 205–218, Berlin, Heidelberg. Springer Berlin Heidelberg.
  52. Language models are unsupervised multitask learners.
  53. Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1650–1659, Sofia, Bulgaria. Association for Computational Linguistics.
  54. A generalist agent.
  55. Prompt programming for large language models: Beyond the few-shot paradigm.
  56. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, CHI EA ’21, New York, NY, USA. Association for Computing Machinery.
  57. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  58. The understanding of scalar implicatures in children with autism spectrum disorder: Dichotomized responses to violations of informativeness. Frontiers in Psychology, 9.
  59. Harnessing the linguistic signal to predict scalar inferences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5387–5403, Online. Association for Computational Linguistics.
  60. Relevance: Communication and Cognition. Language and thought series. Harvard University Press.
  61. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  62. Msp: Multi-stage prompting for making pre-trained language models better translators. arXiv preprint arXiv:2110.06609.
  63. Lamda: Language models for dialog applications. CoRR, abs/2201.08239.
  64. Volden, J. (2017). Autism Spectrum Disorder, pages 59–83. Springer International Publishing, Cham.
  65. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  66. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247.
  67. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  68. Wittgenstein, L. (1921). Tractatus logico-philosophicus. London: Routledge, 1981.
  69. Wittgenstein, L. (1953). Philosophical Investigations. Basil Blackwell, Oxford.
  70. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  71. Opt: Open pre-trained transformer language models.
  72. GRICE: A grammar-based dataset for recovering implicature and conversational rEasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2074–2085, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Laura Ruis (10 papers)
  2. Akbir Khan (17 papers)
  3. Stella Biderman (55 papers)
  4. Sara Hooker (71 papers)
  5. Tim Rocktäschel (86 papers)
  6. Edward Grefenstette (66 papers)
Citations (30)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com