CausalBench: A Comprehensive Benchmark for Causal Learning Capability of LLMs (2404.06349v2)
Abstract: The ability to understand causality significantly impacts the competence of LLMs in output explanation and counterfactual reasoning, as causality reveals the underlying data distribution. However, the lack of a comprehensive benchmark currently limits the evaluation of LLMs' causal learning capabilities. To fill this gap, this paper develops CausalBench based on data from the causal research community, enabling comparative evaluations of LLMs against traditional causal learning algorithms. To provide a comprehensive investigation, we offer three tasks of varying difficulties, including correlation, causal skeleton, and causality identification. Evaluations of 19 leading LLMs reveal that, while closed-source LLMs show potential for simple causal relationships, they significantly lag behind traditional algorithms on larger-scale networks ($>50$ nodes). Specifically, LLMs struggle with collider structures but excel at chain structures, especially at long-chain causality analogous to Chains-of-Thought techniques. This supports the current prompt approaches while suggesting directions to enhance LLMs' causal reasoning capability. Furthermore, CausalBench incorporates background knowledge and training data into prompts to thoroughly unlock LLMs' text-comprehension ability during evaluation, whose findings indicate that, LLM understand causality through semantic associations with distinct entities, rather than directly from contextual information or numerical distributions.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
- J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
- H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, 2023.
- P. Lee, S. Bubeck, and J. Petro, “Benefits, limits, and risks of GPT-4 as an ai chatbot for medicine,” New England Journal of Medicine, vol. 388, no. 13, pp. 1233–1239, 2023.
- X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary computation in the era of large language model: Survey and roadmap,” arXiv preprint arXiv:2401.10034, 2024.
- E. Kıcıman, R. Ness, A. Sharma, and C. Tan, “Causal reasoning and large language models: Opening a new frontier for causality,” arXiv preprint arXiv:2305.00050, 2023.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
- S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
- W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” arXiv preprint arXiv:2304.06364, 2023.
- Z. LYU, Z. Jin, R. Mihalcea, M. Sachan, and B. Schölkopf, “Can large language models distinguish cause from effect?” in UAI 2022 Workshop on Causal Representation Learning, 2022.
- S. Long, A. Piché, V. Zantedeschi, T. Schuster, and A. Drouin, “Causal discovery with language models as imperfect experts,” arXiv preprint arXiv:2307.02390, 2023.
- S. Long, T. Schuster, A. Piché, S. Research et al., “Can large language models build causal graphs?” arXiv preprint arXiv:2303.05279, 2023.
- N. Pawlowski, J. Vaughan, J. Jennings, and C. Zhang, “Answering causal questions with augmented LLMs,” 2023.
- Z. Tang, R. Wang, W. Chen, K. Wang, Y. Liu, T. Chen, and L. Lin, “Towards causalgpt: A multi-agent approach for faithful knowledge reasoning via promoting causal consistency in LLMs,” arXiv preprint arXiv:2308.11914, 2023.
- C. Zhang, S. Bauer, P. Bennett, J. Gao, W. Gong, A. Hilmkil, J. Jennings, C. Ma, T. Minka, N. Pawlowski et al., “Understanding causality with large language models: Feasibility and opportunities,” arXiv preprint arXiv:2304.05524, 2023.
- M. Zečević, M. Willig, D. S. Dhami, and K. Kersting, “Causal parrots: Large language models may talk causality but are not causal,” Transactions on Machine Learning Research, 2023.
- Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf, “Can large language models infer causation from correlation?” arXiv preprint arXiv:2306.05836, 2023.
- R. Tu, C. Ma, and C. Zhang, “Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis,” arXiv preprint arXiv:2301.13819, 2023.
- Y. Zhang, Y. Zhang, Y. Gan, L. Yao, and C. Wang, “Causal graph discovery with retrieval-augmented generation based large language models,” arXiv preprint arXiv:2402.15301, 2024.
- T. Jiralerspong, X. Chen, Y. More, V. Shah, and Y. Bengio, “Efficient causal graph discovery using large language models,” arXiv preprint arXiv:2402.01207, 2024.
- A. Antonucci, G. Piqué, and M. Zaffalon, “Zero-shot causal graph extrapolation from text via LLMs,” arXiv preprint arXiv:2312.14670, 2023.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
- E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40B: an open large language model with state-of-the-art performance,” 2023.
- I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities,” 2023.
- M. Scutari, “Learning bayesian networks with the bnlearn R package,” Journal of Statistical Software, vol. 35, no. 3, p. 1–22, 2010.
- P. W. Holland, “Statistics and causal inference,” Journal of the American Statistical Association, vol. 81, no. 396, pp. 945–960, 1986.
- D. Colombo, M. H. Maathuis et al., “Order-independent constraint-based causal structure learning.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 3741–3782, 2014.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837.
- L. Fan, W. Hua, L. Li, H. Ling, Y. Zhang, and L. Hemphill, “Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes,” arXiv preprint arXiv:2312.14890, 2023.
- L. Chen, T. Ban, X. Wang, D. Lyu, and H. Chen, “Mitigating prior errors in causal structure learning: Towards LLM driven prior knowledge,” arXiv preprint arXiv:2306.07032, 2023.
- I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing bayesian network structure learning algorithm,” Machine learning, vol. 65, pp. 31–78, 2006.
- A. C. Constantinou, Y. Liu, K. Chobtham, Z. Guo, and N. K. Kitson, “Large-scale empirical validation of bayesian network structure learning algorithms with noisy data,” International Journal of Approximate Reasoning, vol. 131, pp. 151–188, 2021.
- N. F. Rajani, B. McCann, C. Xiong, and R. Socher, “Explain yourself! leveraging language models for commonsense reasoning.” Association for Computational Linguistics, July 2019, pp. 4932–4942.
- F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller, “Language models as knowledge bases?” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2463–2473.
- J. Li, B. Hui, G. QU, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. Chang, F. Huang, R. Cheng, and Y. Li, “Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, 2023.
- Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint arXiv:2310.19736, 2023.
- W. L. Buntine, “Operations for learning with graphical models,” Journal of Artificial Intelligence Research, vol. 2, pp. 159–225, 1994.
- P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social science computer review, vol. 9, no. 1, pp. 62–72, 1991.
- K. Yan, W. Fang, H. Lu, X. Zhang, J. Sun, and X. Wu, “Mutual information-guided ga for bayesian network structure learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 8, pp. 8282–8299, 2023.
- Y. Yu, J. Chen, T. Gao, and M. Yu, “DAG-GNN: DAG structure learning with graph neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 7154–7163.
- H. Wang, Z. Wang, R. Zhong, X. Liu, and X. Gao, “The improved ordering-based search method incorporating with ensemble learning,” Cognitive Computation, pp. 1–25, 2024.
- Yu Zhou (335 papers)
- Xingyu Wu (24 papers)
- Beicheng Huang (1 paper)
- Jibin Wu (42 papers)
- Liang Feng (59 papers)
- Kay Chen Tan (83 papers)