Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Understand Context? (2402.00858v1)

Published 1 Feb 2024 in cs.CL

Abstract: Understanding context is key to understanding human language, an ability which LLMs have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models' ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.

Introduction

LLMs have been increasingly employed for a variety of NLP applications, displaying impressive linguistic comprehension and world knowledge. While their performance on various benchmarks is noteworthy, these evaluations may not sufficiently address the models' ability to understand contextual nuances in language. This paper introduces a benchmark specifically crafted to probe LLMs' contextual understanding, comprising four tasks and nine datasets adapted for generative models.

Model Evaluation and Compression

The paper first assesses LLM performance under in-context learning (ICL) settings, comparing pre-trained dense models and fine-tuned state-of-the-art models. Findings indicate dense models fall short in grasping complex contextual features. As LLMs become increasingly large, their resource demands grow, prompting research into model compression techniques like post-training quantization. The paper extends to examining how 3-bit quantization affects LLM performance on the established benchmark.

Extensive Analysis

In contexts rich with linguistic constructs, such as coreference resolution and discourse parsing, LLMs demonstrate variable performance. Larger models fare better on more straightforward tasks, yet struggle with more complex document-based coreferences or nuanced discourse relations, often falling short of the capabilities displayed by fine-tuned models. This suggests a resilience to model compression when it concerns understanding context and an area ripe for further optimization.

Implications and Insights

This paper presents an in-depth look at the current limitations of LLMs' contextual understanding, revealing a performance gap between pre-trained models employing ICL and fine-tuned equivalents. The reduction in performance observed due to quantization highlights a trade-off between model efficiency and linguistic capability. Through the lens of the newly introduced benchmark, the paper carves out a niche for improving the contextual acuity of LLMs and underscores the importance of developing models that balance performance with practicality for real-world deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520–534, Online. Association for Computational Linguistics.
  2. Palm 2 technical report.
  3. Task-optimized adapters for an end-to-end task-oriented dialogue system. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7355–7369, Toronto, Canada. Association for Computational Linguistics.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  6. Coreference resolution through a seq2seq transition-based system. Transactions of the Association for Computational Linguistics, 11:212–226.
  7. A systematic classification of knowledge, reasoning, and context within the ARC dataset. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 60–70, Melbourne, Australia. Association for Computational Linguistics.
  8. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  9. Language models are few-shot learners.
  10. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
  11. Palm: Scaling language modeling with pathways.
  12. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  14. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  15. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  16. Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5918–5924, Hong Kong, China. Association for Computational Linguistics.
  17. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  18. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
  19. ChatGPT for zero-shot dialogue state tracking: A solution or an opportunity? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 936–950, Toronto, Canada. Association for Computational Linguistics.
  20. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  21. Measuring massive multitask language understanding.
  22. Training compute-optimal large language models.
  23. In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  24. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  25. Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 12–17, Florence, Italy. Association for Computational Linguistics.
  26. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization.
  27. Pruning vs quantization: Which is better?
  28. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  29. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023, pages 431–469, Toronto, Canada. Association for Computational Linguistics.
  30. Nghia T. Le and Alan Ritter. 2023. Are large language models robust zero-shot coreference resolvers?
  31. End-to-end neural coreference resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 188–197, Copenhagen, Denmark. Association for Computational Linguistics.
  32. The winograd schema challenge. In 13th International Conference on the Principles of Knowledge Representation and Reasoning, KR 2012, Proceedings of the International Conference on Knowledge Representation and Reasoning, pages 552–561. Institute of Electrical and Electronics Engineers Inc.
  33. Holistic evaluation of language models.
  34. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  35. Wei Liu and Michael Strube. 2023. Annotation-inspired implicit discourse relation classification with auxiliary discourse connective generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15696–15712, Toronto, Canada. Association for Computational Linguistics.
  36. Llm-qat: Data-free quantization aware training for large language models.
  37. TED-CDB: A large-scale Chinese discourse relation dataset on TED talks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2793–2803, Online. Association for Computational Linguistics.
  38. The Stanford CoreNLP natural language processing toolkit. In ACL 2014 System Demonstrations, pages 55–60.
  39. MuDoCo: Corpus for multidomain coreference resolution and referring expression generation. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 104–111, Marseille, France. European Language Resources Association.
  40. LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, Valencia, Spain. Association for Computational Linguistics.
  41. Neural belief tracker: Data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1777–1788.
  42. Up or down? adaptive rounding for post-training quantization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
  43. CorefUD 1.0: Coreference meets Universal Dependencies. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4859–4872, Marseille, France. European Language Resources Association.
  44. OpenAI. 2022. Optimizing language models for dialogue.
  45. OpenAI. 2023. Gpt-4 technical report.
  46. Training language models to follow instructions with human feedback.
  47. Improving open-domain dialogue systems via multi-turn incomplete utterance restoration. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1824–1833, Hong Kong, China. Association for Computational Linguistics.
  48. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  49. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  50. Towards robust linguistic analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143–152, Sofia, Bulgaria. Association for Computational Linguistics.
  51. CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL - Shared Task, pages 1–40, Jeju Island, Korea. Association for Computational Linguistics.
  52. GECOR: An end-to-end generative ellipsis and co-reference resolution model for task-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4547–4557, Hong Kong, China. Association for Computational Linguistics.
  53. Scaling language models: Methods, analysis & insights from training gopher.
  54. A dataset for resolving referring expressions in spoken dialogue via contextual query rewrites (cqr). ArXiv, abs/1903.11783.
  55. Improving multi-turn dialogue modelling with utterance ReWriter. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 22–31, Florence, Italy. Association for Computational Linguistics.
  56. Piqa: an algebra for querying protein data sets. In 15th International Conference on Scientific and Statistical Database Management, 2003., pages 141–150.
  57. Llama: Open and efficient foundation language models.
  58. Cread: Combined resolution of ellipses and anaphora in dialogues. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3390–3406.
  59. Focused transformer: Contrastive training for context scaling.
  60. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  61. The penn discourse treebank 3.0 annotation manual. https://catalog.ldc.upenn.edu/docs/LDC2019T05/PDTB3-Annotation-Manual.pdf.
  62. Kai-Cheng Yang and Filippo Menczer. 2023. Large language models can rate news outlet credibility.
  63. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.
  64. Rptq: Reorder-based post-training quantization for large language models.
  65. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL 2020, pages 109–117.
  66. Hellaswag: Can a machine really finish your sentence?
  67. Opt: Open pre-trained transformer language models.
  68. CrossWOZ: A large-scale Chinese cross-domain task-oriented dialogue dataset. Transactions of the Association for Computational Linguistics, 8:281–295.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yilun Zhu (16 papers)
  2. Joel Ruben Antony Moniz (23 papers)
  3. Shruti Bhargava (10 papers)
  4. Jiarui Lu (31 papers)
  5. Dhivya Piraviperumal (8 papers)
  6. Site Li (15 papers)
  7. Yuan Zhang (331 papers)
  8. Hong Yu (114 papers)
  9. Bo-Hsiang Tseng (20 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com