Can Large Language Models Infer Causation from Correlation? (2306.05836v3)

Published 9 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Causal inference is one of the haLLMarks of human intelligence. While the field of CausalNLP has attracted much interest in the recent years, existing causal inference datasets in NLP primarily rely on discovering causality from empirical knowledge (e.g., commonsense knowledge). In this work, we propose the first benchmark dataset to test the pure causal inference skills of LLMs. Specifically, we formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We curate a large-scale dataset of more than 200K samples, on which we evaluate seventeen existing LLMs. Through our experiments, we identify a key shortcoming of LLMs in terms of their causal inference skills, and show that these models achieve almost close to random performance on the task. This shortcoming is somewhat mitigated when we try to re-purpose LLMs for this skill via finetuning, but we find that these models still fail to generalize -- they can only perform causal inference in in-distribution settings when variable names and textual expressions used in the queries are similar to those in the training set, but fail in out-of-distribution settings generated by perturbing these queries. Corr2Cause is a challenging task for LLMs, and would be helpful in guiding future research on improving LLMs' pure reasoning skills and generalizability. Our data is at https://huggingface.co/datasets/causalnlp/corr2cause. Our code is at https://github.com/causalNLP/corr2cause.

References (46)

Citations (87)

View on Semantic Scholar

Summary

The paper introduces a novel dataset of over 200K samples to test LLMs' capability for inferring causation from correlation.
It employs the PC algorithm to derive causal graphs and d-separation sets for rigorous and formal causal inference evaluation.
Experiments show that current LLMs perform near random chance in causal reasoning, underscoring fundamental architectural limitations.

Can LLMs Infer Causation from Correlation?

The exploration of causal inference capabilities in LLMs remains an area of significant interest, particularly when disentangling pure causal reasoning from empirical knowledge. The paper "Can LLMs Infer Causation from Correlation?" addresses this by proposing a dataset to assess the ability of LLMs to deduce causal relationships from correlational data without relying on prior empirical knowledge.

Research Motivation and Task Definition

The fundamental challenge posed by the paper is evaluating LLMs' ability to perform causal reasoning—a haLLMark of human cognitive capability. This research steps away from empirical causality, investigating if models can discern causation purely through formal reasoning principles. The pivotal task is designed to probe LLMs by providing correlational statements and assessing whether these models can correctly identify causal links.

Dataset Description

The dataset is meticulously crafted, containing over 200,000 samples, each comprising a correlational statement and a causal hypothesis about variable relationships. The formulation mandates the model to determine the validity of inferred causal claims. This dataset is unique as it demands reasoning over pure causal inference, distinct from knowledge-dependent inference of causal relations.

Methodology

The authors employ the Peter-Clark (PC) algorithm as a foundation for dataset generation. They generate causal graphs, derive d-separation sets to identify Markov equivalence classes, and convert these into correlational statements. Subsequently, hypotheses are tested to deduce their validity across all possible graphs in a given equivalence class.

The dataset construction ensures coverage over different types of causal relationships, such as parental influence, ancestral/descendant connections, and common cause/effect scenarios, providing a nuanced challenge for LLMs.

Experimental Setup and Results

The authors evaluate 17 state-of-the-art LLMs using this benchmark. Consistently low performance across models, hovering around random chance levels, suggests a significant gap in LLMs' ability to perform pure causal reasoning. This elucidates a fundamental limitation in current LLM architectures, which excel at knowledge retrieval but falter in reasoning tasks devoid of explicit training context.

Finetuning models showed improved, yet unreliable results, indicating a propensity to overfit distribution rather than genuinely learning the underlying causal principles. Robustness tests with paraphrased and refactored inputs further reveal substantial drops in model performance, underscoring the fragility of these models when detached from familiar training contexts.

Implications and Future Directions

This research emphasizes the need for advancing LLM capabilities beyond empirical data reproduction toward genuine reasoning skill development. The findings highlight a critical research avenue: enhancing LLM architectures or their training methodologies to better encapsulate logical and causational reasoning.

The ability of LLMs to infer causation from correlation has applications in numerous fields, including scientific research, where distinguishing causation from mere correlation is crucial. The implications for AI development are profound, suggesting paths toward more sophisticated models that could impact areas ranging from automated scientific hypothesis generation to advanced decision support systems.

Furthermore, the limitations observed suggest revisiting model training strategies to incorporate structured representations and reasoning frameworks, potentially drawing from disciplines such as causal discovery and logic programming.

In conclusion, the paper not only sheds light on the current capability gaps in LLMs but also sets a foundation for future explorations into enhancing AI reasoning skills—a challenge that remains paramount for advancing AI towards more human-like intelligence.

PDF Markdown

Related Papers

GitHub

GitHub - causalNLP/corr2cause: Data and code for the CausalNLI dataset paper (106 stars)

Tweets

https://twitter.com/ZhijingJin/status/1747277250909626610

https://twitter.com/ZhijingJin/status/1787194387002277974

https://twitter.com/ZhijingJin/status/1745218292229263645

https://twitter.com/lena_maierhein/status/1788573358537290224

https://twitter.com/diego_pacheco/status/1780888189869392188

https://twitter.com/ApostolosLymp/status/1765832943610503388

YouTube

Show All Videos