Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements (2404.06283v1)

Published 9 Apr 2024 in cs.CL
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements

Abstract: The task of reading comprehension (RC), often implemented as context-based question answering (QA), provides a primary means to assess LLMs' natural language understanding (NLU) capabilities. Yet, when applied to LLMs with extensive built-in world knowledge, this method can be deceptive. If the context aligns with the LLMs' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from LLMs' internal information. Conversely, using data that conflicts with the models' knowledge creates erroneous trends which distort the results. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities. This task is entirely independent of the models' world knowledge, enabling us to evaluate LLMs' linguistic abilities without the interference of parametric knowledge. Testing ChatGPT, GPT-4, LLaMA 2 and Mixtral on such imaginary data, we uncover a class of linguistic phenomena posing a challenge to current LLMs, involving thinking in terms of alternative, hypothetical scenarios. While all the models handle simple affirmative and negative contexts with high accuracy, they are much more prone to error when dealing with modal and conditional contexts. Crucially, these phenomena also trigger the LLMs' vulnerability to knowledge-conflicts again. In particular, while some models prove virtually unaffected by knowledge conflicts in affirmative and negative contexts, when faced with more semantically involved modal and conditional environments, they often fail to separate the text from their internal knowledge.

Revisiting Text Understanding and Context Faithfulness in LLMs through Imaginary Instances

Introduction to Imaginary Instances for Reading Comprehension

In the continual assessment of Natural Language Understanding (NLU) capabilities of LLMs, the integration of Reading Comprehension (RC) tasks, particularly context-based Question Answering (QA), remains pivotal. Traditional methods may fall short by either aligning too closely with or conflicting against the LLMs' extensive built-in knowledge, thus skewing results. This paper introduces an innovative approach using "imaginary instances" in RC tasks to bypass this issue, providing a purer measure of an LLM's text understanding capabilities free from the distortions of built-in knowledge.

Evaluation with Imaginary Instances

Creating Neutral Testing Conditions

The proposed method involves textual modifications to traditional QA tasks where real-world entities and facts are replaced with fictive counterparts, thus ensuring that LLMs' responses are uninfluenced by their pre-existing knowledge. The entities and facts used are carefully crafted to contain no overlap with real-world knowledge, ensuring that LLMs must rely solely on the linguistic content provided to answer correctly.

Strong Numerical Results and Implications

Results from testing top-performing models like ChatGPT, GPT-4, LLaMA 2, and Mixtral on these imaginary datasets show a significant distinction between their capabilities in handling simple affirmative/negative scenarios versus more complex modal and conditional statements. The paper reveals that while models handle straightforward contexts with high accuracy, their performance is significantly impeded in scenarios requiring interpretations of hypotheticals (modal verbs and conditionals), highlighting a crucial gap in current NLU capabilities.

Deep Dive Into Non-Affirmative Text Handling

The investigation extends to non-affirmative text structures, such as negations and hypothetical contexts, which often require the model to abstain from providing a definitive answer when the context does not supply sufficient information. This "ability to abstain" is crucial in real-world applications, yet as demonstrated, models frequently default to incorrect or ungrounded answers when faced with such structures. Particularly, the paper illustrates how models struggle more with hypothetical constructs, indicating a significant challenge in modeling alternative, "possible worlds," scenarios.

Assessing Context-Faithfulness Across Affirmative and Hypothetical Constructs

The effectiveness of LLMs in sticking purely to provided text (context-faithfulness) is further scrutinized under different setups: where context aligns with, contradicts, or is independent of their built-in knowledge. Notably, while some models show robustness in affirmative and negative contexts, their reliability waivers in hypothetical scenarios—suggesting a susceptibility to internal knowledge even when it conflicts with given text. This nuanced exploration underlines that even models demonstrating high context-faithfulness in simpler tasks may falter in more complex semantic environments.

Speculations on Future Developments

Practical and Theoretical Advancements

The findings suggest an urgent need for future models to better handle modal and conditional contexts which involve abstract, non-real-world scenarios. This advancement could significantly enhance the applicability and reliability of LLMs in tasks requiring deep comprehension and factual adherence, such as in automated content generation, academic research, or legal document analysis.

Forward-looking Theoretical Implications

Theoretically, the paper challenges current understandings of LLMs' language comprehension and posits that true NLU might still be an elusive goal, particularly in dealing with non-concrete, speculative content. This opens further avenues in AI research to develop models that better mimic human-like understanding and reasoning in uncertain or abstract realms.

Conclusion

By introducing imaginary instances, this research shifts the paradigm of evaluating LLMs' understanding and faithfulness to text. It presents a foundational step toward more accurately measuring true language comprehension capabilities, which are critical for both practical applications and the theoretical advancement of AI technology. The rigorous assessment of LLMs across different contexts and the revealing insights into their operational limits provide a benchmark for future developments aimed at achieving more sophisticated and reliable natural language processing systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection.
  2. Fact Checking with Insufficient Evidence. Transactions of the Association for Computational Linguistics, 10:746–763.
  3. A survey on machine reading comprehension systems.
  4. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
  5. Danqi Chen. 2018. Neural Reading Comprehension and Beyond. Ph.D. thesis, Stanford University.
  6. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2292–2307. Association for Computational Linguistics.
  7. Adaptation with self-evaluation to improve selective prediction in llms.
  8. Disco: Distilling counterfactuals with large language models.
  9. Gotcha! don’t trick me with unanswerable questions! self-aligning large language models for responding to unknown questions.
  10. Don’t hallucinate, abstain: Identifying llm knowledge gaps via multi-llm collaboration.
  11. A Survey on Automated Fact-Checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  12. Teaching machines to read and comprehend.
  13. Deep read: A reading comprehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 325–332, College Park, Maryland, USA. Association for Computational Linguistics.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  15. Mixtral of experts.
  16. Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models.
  17. Realtime qa: What’s the answer right now?
  18. Daniel Khashabi. 2019. Reasoning-driven question-answering for natural language understanding.
  19. Tamara Khomutova. 2014. Mood and modality in modern english. Procedia - Social and Behavioral Sciences, 154.
  20. Can ChatGPT understand causal language in science claims? In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 379–389, Toronto, Canada. Association for Computational Linguistics.
  21. Large language models are zero-shot reasoners.
  22. The narrativeqa reading comprehension challenge.
  23. Saul A. Kripke. 1959. A completeness theorem in modal logic. The Journal of Symbolic Logic, 24(1):1–14.
  24. Wendy G Lehnert. 1978. The Process of Question Answering. Lawrence Erlbaum Associates, Hillsdale, N. J.
  25. Large language models with controllable working memory.
  26. Examining llms’ uncertainty expression towards questions outside parametric knowledge.
  27. Entity-based knowledge conflicts in question answering. CoRR, abs/2109.05052.
  28. Hybrid long document summarization using c2f-far and chatgpt: A practical study.
  29. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories.
  30. Lisa Matthewson. 2016. Modality, Cambridge Handbooks in Language and Linguistics. Cambridge University Press.
  31. Eliza Mik. 2024. Caveat lector: Large language models in legal practice.
  32. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering.
  33. The possible, the plausible, and the desirable: Event-based modality detection for language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 953–965, Online. Association for Computational Linguistics.
  34. "merge conflicts!" exploring the impacts of external distractors to parametric knowledge graphs.
  35. Is chatgpt a general-purpose natural language processing task solver?
  36. Know what you don’t know: Unanswerable questions for squad.
  37. Desiderata for the context use of question answering systems. ArXiv, abs/2401.18001.
  38. Trusting your evidence: Hallucinate less with context-aware decoding.
  39. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  40. Towards expert-level medical question answering with large language models.
  41. Ursula Stephany. 1986. Modality. Cambridge University Press.
  42. Benchmarking machine reading comprehension: A psychological perspective. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1592–1612, Online. Association for Computational Linguistics.
  43. Llama 2: Open foundation and fine-tuned chat models.
  44. Neeraj Varshney and Chitta Baral. 2023. Post-abstention: Towards reliably re-attempting the abstained instances in QA. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 967–982, Toronto, Canada. Association for Computational Linguistics.
  45. Can NLP models correctly reason over contexts that break the common assumptions? CoRR, abs/2305.12096.
  46. Haoran Wang and Kai Shu. 2023. Explainable claim verification via knowledge-grounded reasoning with large language models.
  47. Adaptive chameleon or stubborn sloth: Unraveling the behavior of large language models in knowledge clashes. ArXiv, abs/2305.13300.
  48. How language model hallucinations can snowball.
  49. Merging generated and retrieved knowledge for open-domain qa. ArXiv, abs/2310.14393.
  50. Relying on the unreliable: The impact of language models’ reluctance to express uncertainty. ArXiv, abs/2401.06730.
  51. Context-faithful prompting for large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Victoria Basmov (4 papers)
  2. Yoav Goldberg (142 papers)
  3. Reut Tsarfaty (54 papers)
Citations (5)