Failure Modes of LLMs for Causal Reasoning on Narratives (2410.23884v5)
Abstract: The ability to robustly identify causal relationships is essential for autonomous decision-making and adaptation to novel scenarios. However, accurately inferring causal structure requires integrating both world knowledge and abstract logical reasoning. In this work, we investigate the interaction between these two capabilities through the representative task of causal reasoning over narratives. Through controlled synthetic, semi-synthetic, and real-world experiments, we find that state-of-the-art LLMs often rely on superficial heuristics -- for example, inferring causality from event order or recalling memorized world knowledge without attending to context. Furthermore, we show that simple reformulations of the task can elicit more robust reasoning behavior. Our evaluation spans a range of causal structures, from linear chains to complex graphs involving colliders and forks. These findings uncover systematic patterns in how LLMs perform causal reasoning and lay the groundwork for developing methods that better align LLM behavior with principled causal inference.
- Anthropic. Claude 3.5 sonnet. 2024. Available at: https://www.anthropic.com/news/claude-3-5-sonnet.
- J. Callan. The lemur project and its clueweb12 dataset. In Invited talk at the SIGIR 2012 Workshop on Open-Source Information Retrieval, 2012.
- Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
- A survey on in-context learning, 2024. URL https://arxiv.org/abs/2301.00234.
- Causenet: Towards a causality graph extracted from the web. In ACM international conference on Information & Knowledge Management, 2020.
- Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
- How can we know what language models know?, 2020.
- CLadder: A benchmark to assess causal reasoning capabilities of language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=e2wtjx0Yqu.
- Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models, 2024. URL https://arxiv.org/abs/2402.14409.
- Hurdles to progress in long-form question answering. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.393. URL https://aclanthology.org/2021.naacl-main.393.
- Training language models to self-correct via reinforcement learning. arXiv preprint arXiv:2409.12917, 2024.
- Causal reasoning and large language models: Opening a new frontier for causality, 2024. URL https://arxiv.org/abs/2305.00050.
- Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172.
- Large language models and causal inference in collaboration: A comprehensive survey. arXiv preprint arXiv:2403.09606, 2024.
- Entity-based knowledge conflicts in question answering, 2022. URL https://arxiv.org/abs/2109.05052.
- Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024.
- Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
- J. Pearl. Causality. Cambridge university press, 2009.
- Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Hallucination reduction in long input text summarization, 2023. URL https://arxiv.org/abs/2309.16781.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- The causal news corpus: Annotating causal relations in event sentences from news. In Language Resources and Evaluation Conference. European Language Resources Association, 2022. URL https://aclanthology.org/2022.lrec-1.246.
- M. Waldmann. The Oxford handbook of causal reasoning. Oxford University Press, 2017.
- A & b== b & a: Triggering logical reasoning failures in large language models. arXiv preprint arXiv:2401.00757, 2024.
- Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903.
- Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=auKAUJZMO6.
- Star: Bootstrapping reasoning with reasoning, 2022. URL https://arxiv.org/abs/2203.14465.
- Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.
- Causal parrots: Large language models may talk causality but are not causal, 2023. URL https://arxiv.org/abs/2308.13067.
- Understanding causality with large language models: Feasibility and opportunities. arXiv preprint arXiv:2304.05524, 2023.