Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Infer Causation from Correlation? (2306.05836v3)

Published 9 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Causal inference is one of the haLLMarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of LLMs. Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.

Can LLMs Infer Causation from Correlation?

The exploration of causal inference capabilities in LLMs remains an area of significant interest, particularly when disentangling pure causal reasoning from empirical knowledge. The paper "Can LLMs Infer Causation from Correlation?" addresses this by proposing a dataset to assess the ability of LLMs to deduce causal relationships from correlational data without relying on prior empirical knowledge.

Research Motivation and Task Definition

The fundamental challenge posed by the paper is evaluating LLMs' ability to perform causal reasoning—a haLLMark of human cognitive capability. This research steps away from empirical causality, investigating if models can discern causation purely through formal reasoning principles. The pivotal task is designed to probe LLMs by providing correlational statements and assessing whether these models can correctly identify causal links.

Dataset Description

The dataset is meticulously crafted, containing over 200,000 samples, each comprising a correlational statement and a causal hypothesis about variable relationships. The formulation mandates the model to determine the validity of inferred causal claims. This dataset is unique as it demands reasoning over pure causal inference, distinct from knowledge-dependent inference of causal relations.

Methodology

The authors employ the Peter-Clark (PC) algorithm as a foundation for dataset generation. They generate causal graphs, derive d-separation sets to identify Markov equivalence classes, and convert these into correlational statements. Subsequently, hypotheses are tested to deduce their validity across all possible graphs in a given equivalence class.

The dataset construction ensures coverage over different types of causal relationships, such as parental influence, ancestral/descendant connections, and common cause/effect scenarios, providing a nuanced challenge for LLMs.

Experimental Setup and Results

The authors evaluate 17 state-of-the-art LLMs using this benchmark. Consistently low performance across models, hovering around random chance levels, suggests a significant gap in LLMs' ability to perform pure causal reasoning. This elucidates a fundamental limitation in current LLM architectures, which excel at knowledge retrieval but falter in reasoning tasks devoid of explicit training context.

Finetuning models showed improved, yet unreliable results, indicating a propensity to overfit distribution rather than genuinely learning the underlying causal principles. Robustness tests with paraphrased and refactored inputs further reveal substantial drops in model performance, underscoring the fragility of these models when detached from familiar training contexts.

Implications and Future Directions

This research emphasizes the need for advancing LLM capabilities beyond empirical data reproduction toward genuine reasoning skill development. The findings highlight a critical research avenue: enhancing LLM architectures or their training methodologies to better encapsulate logical and causational reasoning.

The ability of LLMs to infer causation from correlation has applications in numerous fields, including scientific research, where distinguishing causation from mere correlation is crucial. The implications for AI development are profound, suggesting paths toward more sophisticated models that could impact areas ranging from automated scientific hypothesis generation to advanced decision support systems.

Furthermore, the limitations observed suggest revisiting model training strategies to incorporate structured representations and reasoning frameworks, potentially drawing from disciplines such as causal discovery and logic programming.

In conclusion, the paper not only sheds light on the current capability gaps in LLMs but also sets a foundation for future explorations into enhancing AI reasoning skills—a challenge that remains paramount for advancing AI towards more human-like intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Multitask instruction-based prompting for fallacy recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8172–8187, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.560.
  2. Abductive commonsense reasoning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=Byg1v1HKDB.
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. David Maxwell Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554, 2002. URL http://jmlr.org/papers/v3/chickering02b.html.
  6. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pp.  177–190. Springer, 2006.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  8. Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10:524, 2019. ISSN 1664-8021. doi: 10.3389/fgene.2019.00524. URL https://www.frontiersin.org/article/10.3389/fgene.2019.00524.
  9. Causal inference in statistics: A primer. John Wiley and Sons, 2016.
  10. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp.  394–398, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://aclanthology.org/S12-1052.
  11. Deberta: Decoding-enhanced Bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=XPZIaotutsD.
  12. Nonlinear causal discovery with additive noise models. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou (eds.), Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pp.  689–696. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2008/hash/f7664060cc52bc6f3d620bcedc94a4b6-Abstract.html.
  13. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp.  8018–8025. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6311.
  14. Logical fallacy detection. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  7180––7198, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://arxiv.org/abs/2202.13758.
  15. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.
  16. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  17. RoBERTa: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  18. Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp.  521–528, Manchester, UK, August 2008. Coling 2008 Organizing Committee. URL https://aclanthology.org/C08-1066.
  19. Practical graph isomorphism, II. J. Symb. Comput., 60:94–112, 2014. doi: 10.1016/j.jsc.2013.09.003. URL https://doi.org/10.1016/j.jsc.2013.09.003.
  20. TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  119–126, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.16. URL https://aclanthology.org/2020.emnlp-demos.16.
  21. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  22. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155, 2022. doi: 10.48550/arXiv.2203.02155. URL https://doi.org/10.48550/arXiv.2203.02155.
  23. Judea Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, 1988.
  24. Judea Pearl. Causality: Models, reasoning and inference (2nd ed.). Cambridge University Press, 2009.
  25. Elements of causal inference: Foundations and learning algorithms. The MIT Press, 2017. URL https://mitpress.mit.edu/books/elements-causal-inference.
  26. Counterfactual story reasoning and generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5043–5053, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1509. URL https://aclanthology.org/D19-1509.
  27. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
  28. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, 2019. URL http://arxiv.org/abs/1910.01108.
  29. ATOMIC: an atlas of machine commonsense for if-then reasoning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp.  3027–3035. AAAI Press, 2019a. doi: 10.1609/aaai.v33i01.33013027. URL https://doi.org/10.1609/aaai.v33i01.33013027.
  30. Social iqa: Commonsense reasoning about social interactions. In EMNLP 2019, 2019b.
  31. A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res., 7:2003–2030, 2006. URL http://jmlr.org/papers/v7/shimizu06a.html.
  32. Pre-trained summarization distillation. CoRR, abs/2010.13002, 2020. URL https://arxiv.org/abs/2010.13002.
  33. Causal discovery and inference: Concepts and recent methodological advances. In Applied informatics, volume 3, pp.  1–28. SpringerOpen, 2016.
  34. Causation, prediction, and search. 1993.
  35. Causation, Prediction, and Search, Second Edition. Adaptive computation and machine learning. MIT Press, 2000. ISBN 978-0-262-19440-2.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  37. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  38. Causal-discovery performance of chatgpt in the context of neuropathic pain diagnosis. arXiv preprint arXiv:2301.13819, 2023.
  39. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  40. Probing for correlations of causal facts: Large language models and causality, 2023. URL https://openreview.net/forum?id=UPwzqPOs4-.
  41. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  42. Echo: Event causality inference via human-centric reasoning. arXiv preprint arXiv:2305.14740, 2023.
  43. Causal parrots: Large language models may talk causality but are not causal. arXiv preprint arXiv:2308.13067, 2023.
  44. Causality discovery with additive disturbances: An information-theoretical perspective. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part II 20, pp.  570–585. Springer, 2009.
  45. Reasoning about goals, steps, and temporal ordering with WikiHow. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4630–4639, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.374. URL https://aclanthology.org/2020.emnlp-main.374.
  46. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/arXiv.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhijing Jin (68 papers)
  2. Jiarui Liu (34 papers)
  3. Zhiheng Lyu (16 papers)
  4. Spencer Poff (7 papers)
  5. Mrinmaya Sachan (124 papers)
  6. Rada Mihalcea (131 papers)
  7. Mona Diab (71 papers)
  8. Bernhard Schölkopf (412 papers)
Citations (87)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com