Do Large Language Models Reason Causally Like Us? Even Better? (2502.10215v2)

Published 14 Feb 2025 in cs.AI and cs.LG

Abstract: Causal reasoning is a core component of intelligence. LLMs have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. LLMs' causal inferences ranged from often nonsensical (GPT-3.5) to human-like to often more normatively aligned than those of humans (GPT-4o, Gemini-Pro, and Claude). Computational model fitting showed that one reason for GPT-4o, Gemini-Pro, and Claude's superior performance is they didn't exhibit the "associative bias" that plagues human causal reasoning. Nevertheless, even these LLMs did not fully capture subtler reasoning patterns associated with collider graphs, such as "explaining away".

PDF Abstract

An Analysis of Causal Reasoning in LLMs Compared to Humans

The paper "Do LLMs Reason Causally Like Us? Even Better?" investigates the extent to which LLMs perform causal reasoning in comparison to human subjects. Causal reasoning, a pivotal aspect of human cognition, enables the understanding and inference of causal relationships beyond mere statistical correlations. This research aims to delineate the capabilities and limitations of LLMs in replicating human-like causal reasoning or achieving normative inference standards through probabilistic modeling.

Overview of Research

The authors embarked on an empirical comparison between human subjects and four LLMs—GPT-3.5, GPT-4o, Claude-3-Opus, and Gemini-Pro—using tasks based on causal structures known as collider graphs. Collider graphs allow for the examination of inference types such as predictive inference, the independence of causes, and diagnostic inference, which includes the explaining away phenomena. Human data were sourced from previous experimental studies to provide a benchmark. The tasks required both humans and LLMs to judge the likelihood of an unspecified variable given a set of observed variables, thereby testing their understanding of causal relationships existing in hypothetical domains, such as meteorology, economics, and sociology.

Key Findings

The experimental results revealed distinct patterns of causal reasoning across both humans and LLMs, highlighting differences in the interpretation of causal structures.

Correlation with Human Reasoning: LLMs demonstrated a degree of human-like reasoning, with Claude and GPT-4o showing higher correlations with human data ( $r_s > 0.6$ ) compared to the other models. This suggests that, to a certain extent, LLMs can perform plausibly human-like causal reasoning tasks.
Alignment with Normative Inferences: The paper fitted Causal Bayes Nets (CBNs) as a normative benchmark to compare the agents' performances. GPT-4o and Claude-3-Opus had higher alignment with these normative models and exhibited characteristics of explaining away, where the presence of one cause decreased the likelihood of another given a shared effect.
Associative vs. Causal Understanding: Models like Gemini-Pro and GPT-3.5 often failed to show explaining away, indicating a more associative form of reasoning. Conversely, GPT-4o exhibited a more sophisticated understanding of causal semantics, sometimes even exceeding the normative reasoning observed in human subjects for specific tasks.
Domain Knowledge Influence: The LLMs heavily relied on domain knowledge within their data, evident from a variance in causal strength estimations across different domains. This intricate interaction with domain knowledge raised questions about their generalizability and robustness when applied to unfamiliar scenarios.

Implications and Future Research Directions

The findings underscore the complexity of embedding authentic causal reasoning into LLMs. As these models are increasingly integrated into decision-making processes, understanding their potential biases and limitations becomes critical. The results suggest continued exploration into whether LLMs, especially more advanced versions, can transcend the limitations of associative reasoning apparent in current models.

Future research could expand the scope beyond collider causal structures to include various causal network topologies and more diverse causal inference tasks. Investigating the effects of internal hyperparameters, such as the temperature setting in LLMs, on inferential outputs is also warranted. Another crucial area lies in exploring more human-like reasoning through enhancements in intervention and causal learning tasks.

Conclusion

Overall, the research reveals that while LLMs like GPT-4o can approach human-like or even normative causal reasoning under certain conditions, significant disparities persist. The reliance on domain knowledge, varied inference patterns under different tasks, and associative tendencies emphasize the need for ongoing development in the causal reasoning domain within AI to achieve broader applicability and reliability.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Hanna M. Dettki (1 paper)
Brenden M. Lake (41 papers)
Charley M. Wu (5 papers)
Bob Rehder (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/theomitsa/status/1893278892024058260

https://twitter.com/theomitsa/status/1893279173705085127