Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 83 tok/s Pro

Kimi K2 175 tok/s Pro

GPT OSS 120B 444 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery (2507.23488v1)

Published 31 Jul 2025 in cs.AI

Abstract: Causal inference remains a fundamental challenge for LLMs. Recent advances in internal reasoning with LLMs have sparked interest in whether state-of-the-art reasoning models can robustly perform causal discovery-a task where conventional models often suffer from severe overfitting and near-random performance under data perturbations. We study causal discovery on the Corr2Cause benchmark using the emergent OpenAI's o-series and DeepSeek-R model families and find that these reasoning-first architectures achieve significantly greater native gains than prior approaches. To capitalize on these strengths, we introduce a modular in-context pipeline inspired by the Tree-of-Thoughts and Chain-of-Thoughts methodologies, yielding nearly three-fold improvements over conventional baselines. We further probe the pipeline's impact by analyzing reasoning chain length, complexity, and conducting qualitative and quantitative comparisons between conventional and reasoning models. Our findings suggest that while advanced reasoning models represent a substantial leap forward, carefully structured in-context frameworks are essential to maximize their capabilities and offer a generalizable blueprint for causal discovery across diverse domains.

Summary

The paper introduces a modular in-context pipeline that decomposes the PC algorithm into discrete reasoning stages, achieving nearly an 18% F1 improvement.
The paper leverages reasoning-specialist LLMs like OpenAI's o3-mini and DeepSeek-R1 to perform iterative self-checks and refine causal inference under data perturbations.
The paper demonstrates that decomposing causal discovery into structured stages enhances robustness and accuracy, outperforming conventional zero-shot models on the CORR2CAUSE benchmark.

Modular In-Context Learning for Causal Discovery

The paper "Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery" (2507.23488) addresses the challenge of robust causal discovery using LLMs, particularly in scenarios where conventional models exhibit overfitting and poor generalization under data perturbations. The authors introduce a modular in-context pipeline framework that decomposes the PC algorithm into distinct reasoning stages, leveraging state-of-the-art reasoning-specialist LLMs like OpenAI's o3-mini and DeepSeek-R1. The results demonstrate significant improvements over traditional baselines, highlighting the importance of structured in-context learning for maximizing the capabilities of advanced reasoning models in causal inference tasks.

Background and Motivation

LLMs have shown promise in various reasoning tasks, but their ability to perform robust causal discovery remains a challenge. The Corr2Cause benchmark [jin2024largelanguagemodelsinfer] reveals that even fine-tuned models struggle with minimal perturbations, often relying on superficial pattern matching rather than genuine causal inference. In response, the authors evaluate the performance of reasoning-specialist LLMs, which exhibit internally disciplined inference processes. They introduce a modular in-context pipeline inspired by Tree-of-Thoughts and Chain-of-Thoughts methodologies [yao2023treethoughtsdeliberateproblem, wei2023chainofthoughtpromptingelicitsreasoning] to improve accuracy by guiding the model through a sequence of subproblems, each with its own prompt and parser.

Methodology: Modular In-Context Pipeline

The paper's core contribution is a modular in-context pipeline framework tailored for PC-algorithm-based causal discovery. The framework decomposes the causal discovery process into four focused prompts: undirected skeleton extraction, v-structure identification, Meek’s rules orientation, and hypothesis evaluation. (Figure 1) illustrates the stage-wise in-context pipeline, where each stage's output serves as the input for the next stage. This decomposition allows for richer intermediate artifacts, sharpens the model's focus, and transforms a brittle one-shot pass into a robust multi-stage inference procedure. The PC algorithm's correctness relies on three assumptions: DAGs, the Causal Markov Condition, and faithfulness [scheines2005introduction, 10.1214/12-AOS1080].

Figure 1: Stage-wise in-context pipeline for causal discovery, illustrating the flow of data between prompts and parsing modules.

Implementation Details

The implementation builds upon the PC algorithm [spirtes2001causation], a constraint-based causal discovery algorithm that uses conditional independence testing to recover a causal graph from observational data. The algorithm consists of two main phases: constructing an undirected skeleton and partially orienting the remaining edges. The modular pipeline issues each stage as a dedicated prompt, which can be contrasted with a single, comprehensive baseline. The authors establish a zero-shot baseline by prompting models to judge causal hypotheses from natural-language conditional independence statements. They compare this baseline against a modular in-context causal discovery pipeline that decomposes the PC algorithm into distinct reasoning stages, with each stage's output being consumed by the next stage's prompt.

Experimental Evaluation

The empirical evaluation consists of three parts: benchmarking the zero-shot reasoning baselines and modular in-context pipeline against published Corr2Cause results, performing an inter-model analysis of reasoning LLMs to identify error-prone pipeline stages, and conducting a combined quantitative and qualitative failure analysis of conventional-model errors and the reasoning model's corrective mechanisms. (Figure 2) shows the efficiency trade-off on the Corr2Cause benchmark, plotting F1 scores against average token usage for both baseline and pipeline settings. The results validate that reasoning-specialist LLMs substantially outperform prior models in a pure zero-shot setting and that a carefully structured in-context pipeline can unlock further gains without external supervision or fine-tuning.

Figure 2: Efficiency trade-off on the Corr2Cause benchmark, demonstrating the correlation between increased token investment in the pipeline and F1 improvements.

The authors compared their approach with PC-SubQ [sgouritsa2024prompting], noting differences in modular design, quantitative performance, and focus on reasoning models. Their modular pipeline achieves an F1 score of 83.83% on CORR2CAUSE with OpenAI o3-mini, a nearly 18% improvement over the baseline reasoning model. In contrast, PC-SubQ reports substantially lower F1 scores.

Qualitative and Quantitative Analysis

A qualitative and quantitative failure analysis contrasts the performance of conventional models (LLaMA3.3-70B) and reasoning models (DeepSeek-R1-70B). (Figure 3) (left) compares the average number of independence tests of pair of variables that common models classify correctly versus those it misclassifies. (Figure 3) (right) shows the average error of the conventional model on instances where the reasoning model revisits $k$ times in its reasoning trace. The reasoning model performs iterative self-checks, revisits each edge during reasoning trace micro-steps, and refines decisions in real time, underlying its qualitative advantage.

Figure 3: (Left) Boxplot of conditional independence sets metrics. (Right) Bars showing the fraction of edges misclassified by the common model among those the reasoning model revisited $k$ times.

Implications and Future Directions

The modular in-context pipeline offers a blueprint for causal discovery, leveraging the strengths of modern reasoning models. The gains are attributed to the decomposition of the PC algorithm into separate stages, the production of rich intermediate artifacts, and the amplification of the model's native iterative enrichment of its reasoning traces. However, this comes at the cost of increased computation, as multiple sequential model requests and longer reasoning traces substantially raise token usage. Future work may focus on assessing robustness and effectiveness on other datasets.

Conclusion

The paper demonstrates that state-of-the-art reasoning models can surpass prior causal-discovery results on the Corr2Cause benchmark in zero-shot settings. The modular in-context pipeline achieves significant F1 improvements without fine-tuning, emphasizing the importance of structured reasoning processes in causal inference. While computational costs are a consideration, the framework provides a valuable approach for enhancing causal discovery through carefully designed prompt engineering and stage decomposition.