- The paper proposes a two-stage pipeline where LLMs generate an intermediate knowledge graph, raising the F1 score on causal tasks from 32.71 to 48.26.
- The methodology uses tool-calling and enforced JSON schemas to construct precise knowledge graphs that guide accurate causal reasoning.
- Empirical results on Corr2Cause show the structured approach outperforms models like GPT-4 and BART MNLI, enhancing both recall and precision.
The paper "Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks" (2505.18034) addresses the significant challenge LLMs face in reliably distinguishing causation from correlation, particularly highlighted by their poor performance and generalization on benchmarks like Corr2Cause [DBLP:conf/iclr/Jin0LPSMDS24]. Existing LLMs often act as "causal parrots," relying on patterns from training data rather than performing structured reasoning.
The authors hypothesize that this limitation stems from the lack of explicit structural reasoning. To counter this, they propose a novel structured approach that guides the LLM to externalize its thinking process. Instead of directly answering a causal query based on correlational premises, the LLM first constructs an intermediate knowledge graph that systematically encodes the provided information. This graph serves as a structured representation upon which the final causal judgment is based.
The methodology involves a two-stage pipeline:
- Knowledge Graph Generation: Given a set of correlational statements (premises), the LLM is prompted to generate a structured graph. This graph represents variables as nodes and relationships (correlations, independencies) as edges. At this stage, edges are typically undirected, reflecting statistical dependencies without specifying causal direction immediately.
- Implementation Detail: To ensure the generated graph is well-formed and machine-readable, the authors leverage an OpenAI-style tool-calling approach. They define a Pydantic schema for the knowledge graph (listing nodes and edges with properties like source, target, and label). This schema is converted into an OpenAI tool signature (a JSON schema), and the model's output is strictly enforced to conform to this schema using regex-based logits processing. This prevents the model from producing free-form text and ensures a valid JSON output representing the graph.
- Structure-Aware Causal Inference: In the second stage, the generated knowledge graph (provided back to the model, e.g., as JSON or a DOT format string) is used to inform the LLM's answer to the causal query. The model checks the hypothesis against the graph's structure. For instance, to determine if X causes Z, the model would examine paths between X and Z in the graph and consider potential confounders, much like a human analyzing a causal diagram. The explicit graph structure grounds the model's reasoning, making it less susceptible to superficial textual cues.
- Implementation Detail: The prompt for the second stage includes the original premise, the causal hypothesis, and the generated knowledge graph (e.g., serialized as JSON and a DOT string). The LLM is tasked with determining if the hypothesis is consistent with the structure implied by the graph. The authors suggest prompting the model to explain its reasoning based on the graph structure to ensure fidelity.
The approach was evaluated on the Corr2Cause benchmark using the Qwen3-32B model, an open-source LLM chosen for its native tool-calling capabilities. A zero-shot baseline (direct prompting) was compared against the structured reasoning method. They also evaluated different graph edge notation styles in an auxiliary experiment, finding that representing undirected correlations with two opposing directed edges (a -> b + b -> a
) yielded the best performance for graph interpretation by the model.
The results show substantial improvements with the structured reasoning framework:
- Qwen3-32B with structured reasoning achieved an F1 score of 48.26 on the Corr2Cause test set, significantly higher than its unstructured baseline performance (F1 32.71).
- This improvement is driven by large gains in recall (from 33.89% to 65.56%) and notable improvements in precision (from 31.61% to 38.19%).
- The structured approach also significantly outperformed off-the-shelf models like GPT-4 (F1 29.08) and BART MNLI (F1 33.38) on this benchmark.
- The structured method demonstrated enhanced robustness to out-of-distribution queries compared to baselines.
- A reference model, Qwen2.5-32B, showed only modest gains with the structured approach, supporting the hypothesis that the model's ability to handle structured outputs (like Qwen3's tool-calling) is key to the method's effectiveness.
The paper concludes that explicitly guiding LLMs to structure their thinking via intermediate representations, such as knowledge graphs, is a promising direction for improving their generalization capabilities in complex reasoning tasks like causal inference. This method provides transparency and allows for more systematic reasoning grounded in structure rather than just linguistic patterns. Potential limitations include the computational cost of graph generation, potential errors in the graph construction step cascading through the pipeline, and dependence on the complexity of the schema and model's tool-calling fidelity. Future work could explore automated graph validation, application to other reasoning tasks, and integration with external symbolic systems.