- The paper demonstrates that LLMs cannot reliably infer causal relationships due to their autoregressive, correlation-based nature.
- It shows that using LLM outputs as priors in causal discovery algorithms risks violating theoretical scoring properties and inducing inconsistencies.
- The authors recommend confining LLMs to assist with heuristic search to improve efficiency while preserving robust, theory-driven causal decisions.
This paper critically re-evaluates the role of LLMs in causal discovery, arguing that LLMs are fundamentally incapable of independently identifying causal relationships and should be restricted to non-decisional, auxiliary roles. The authors contend that the autoregressive, correlation-driven nature of LLMs lacks the theoretical grounding necessary for robust causal reasoning and introduces unreliability when their outputs are used as priors in traditional causal discovery algorithms (CDAs) (2506.00844).
The paper first establishes its critical position: LLMs cannot identify causality, and their outputs should not directly or indirectly determine the existence or directionality of causal links. This means LLMs should neither independently conclude causal relationships nor should their judgments be embedded as prior knowledge (hard constraints or soft penalties) into CDAs. The authors then present a compensatory position: LLMs can assist CDAs by improving the search process for causal graphs, such as guiding heuristic search, thereby accelerating convergence without influencing the final causal decisions, which must remain with theoretically sound CDAs.
Fundamental Limitations of LLMs in Identifying Causality (Section 3)
The paper argues that LLMs' autoregressive modeling, which predicts the next token based on preceding ones (P(x)=P(x1)⋅P(x2∣x1)⋯P(xT∣x1,x2,…,xT−1)), is inherently different from the probability decomposition in structural causal models (SCMs) (P(X1,X2,…,Xn)=i=1∏nP(Xi∣pa(Xi))). This difference means LLMs model word correlations rather than true causal dependencies, often considering all preceding words even when some are conditionally independent given an intermediary in a causal chain.
Empirical studies are cited (detailed in Appendix A.1) to show that LLM performance in identifying causality from text is highly sensitive to:
- Word order: Performance degrades if the text deviates from common causal phrasings in the training data.
- Redundant information: Excessive entities or irrelevant details can obscure true causal links, as they disperse conditional mutual information.
- Distance between entities: LLMs struggle to connect causally related entities that are far apart in the text.
When LLMs are tasked with identifying causality from observational data, they face different challenges. Unlike CDAs that perform numerical operations, LLMs process data as tokenized strings. The paper posits that this tokenization can distort numerical features, hindering the LLM's ability to grasp true causal mechanisms, especially with high-precision data. Empirical studies (Appendix A.2) demonstrate that LLM performance in recognizing correlations from simulated data does not improve with increased numerical precision; in fact, it can degrade, approaching random guessing at high precision. Performance on benchmark bnlearn
datasets also declines with increasing complexity.
Risks of Integrating LLMs with CDAs (Section 4)
The paper scrutinizes methods that integrate LLM outputs as priors into CDAs. It argues that LLM-generated "knowledge" is unreliable and using it undermines the theoretical guarantees of CDAs.
For score-based methods that modify scoring functions (e.g., BIC, BDeu) by adding an LLM-derived prior term (σ(G;D,λ)=σ(G;D)+σ(G;λ)), the paper identifies several issues:
- Mathematical incorrectness: The direct addition of a data-based score σ(G;D) and an LLM-based prior score σ(G;λ) is problematic as they originate from different probability spaces and may have incompatible scales, leading to one term unduly dominating the other.
- Conflict with existing priors: Some scoring functions (e.g., BDeu) already incorporate priors, and adding another LLM-derived prior creates conflicts.
- Violation of scoring function properties: Introducing σ(G;λ) can break crucial properties like decomposability (σ(G;D)=i=1∑nσ(vi,pa(vi);D)) and score local consistency, rendering many optimization algorithms inapplicable and undermining the theoretical basis of the scoring function.
For constraint-based methods, which rely on conditional independence (CI) tests, using LLMs to perform CI queries or adjusting statistical measures (e.g., G2-statistic) with an LLM-derived prior term (G2(X,Y∣Z)−p>χα,f2) is also criticized. Modifying the G2-statistic by subtracting a prior term p distorts its asymptotic distribution, invalidating hypothesis tests based on the chi-squared distribution unless new statistical properties are derived.
Defining Functional Boundaries for LLMs in Collaboration with CDAs (Section 5)
The authors propose a principled way to leverage LLMs: they should not participate in determining causal relationships but can assist the search procedure.
- Prohibited Actions: LLM outputs must not be the final criterion for edge existence/direction or core weights in scoring functions.
- Permitted Actions: LLMs can aid in initializing search spaces, guiding mutation directions in evolutionary algorithms, or resolving cycles. The final causal structure must be determined by established CDA techniques (scoring functions or CI tests).
A case paper demonstrates using LLMs to guide heuristic search in CDAs:
- LLM-Based Initial Population Initialization: LLMs analyze variable information and background knowledge to prune unlikely causal relationships from the search space at initialization.
- LLM-Guided Evolutionary Optimization: LLMs guide crossover and mutation operations by suggesting plausible modifications based on domain knowledge.
- Cycle Detection and Resolution: LLMs analyze cyclic structures and suggest edge removals/adjustments to maintain a Directed Acyclic Graph (DAG).
Experiments (Appendix A.3) on bnlearn
datasets show that LLM-guided heuristic search can outperform traditional methods and other LLM-based approaches in terms of search efficiency and accuracy (measured by F1 score and SHD), especially on medium to large datasets.
Alternative Views and Manipulability of Experimental Results (Section 6)
The paper addresses why many existing studies report favorable results for LLMs in causal discovery. It argues that these positive outcomes are often due to:
- Prompt Engineering: Carefully crafted prompts, sometimes containing ground-truth knowledge (e.g., "Gene X causes Disease Y"), allow LLMs to merely retrieve information rather than perform genuine causal inference.
- Flawed Methodologies Coincidentally Aligning: Soft-constrained scoring functions, though theoretically unsound, might numerically align when prior and data terms are scaled similarly, particularly on small-scale networks.
Experiments (Appendix A.4) using GPT-4 on bnlearn
datasets with high-quality (manually curated) and low-quality (Wikipedia-sourced) prompts demonstrate that soft-constrained scoring functions struggle to filter out incorrect priors. Improvements seen with high-quality prompts are attributed to manual refinement, not the efficacy of the soft constraints.
Conclusion and Call to the Community
The paper concludes with a call for caution and rigor:
- Recognize that current LLM-based causal identification methods lack theoretical guarantees.
- Re-evaluate experimental designs to avoid information leakage and prompt engineering artifacts.
- Prioritize theoretical soundness when integrating LLMs with CDAs, confining LLMs to non-decisional roles.
- Invest in LLM architectures and training methods specifically designed for causal reasoning, rather than forcing general-purpose LLMs via prompt engineering.
Overall, the paper argues for a significant shift in how the research community approaches the use of LLMs in causal discovery, emphasizing the preservation of core causal principles and advocating for LLMs to be used as intelligent assistants rather than decision-makers in the causal inference process.