Origin of CoT success in LLM mathematical reasoning

Determine whether the observed strong performance of large language models on mathematical problems under Chain-of-Thought prompting primarily arises from search-based strategies, rote procedural execution, or rule-consistent logical inference.

Background

The paper motivates a need to rigorously characterize how Chain-of-Thought (CoT) operates in mathematical problem solving by distinguishing genuine logical inference from search or rote procedures. Existing evaluations often emphasize final-answer accuracy (e.g., PASS@k), leaving ambiguity about whether solutions reflect coherent reasoning or exploratory sampling.

To address this uncertainty, the authors introduce a graph-based framework (DAG-MATH) that models CoT as trajectories over directed acyclic graphs and propose metrics such as logical closeness and perfect reasoning rate (PRR) to diagnose reasoning fidelity beyond final-answer correctness.

References

LLMs demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning.

— DAG-Math: Graph-Guided Mathematical Reasoning in LLMs (2510.19842 - Zhang et al., 19 Oct 2025) in Abstract (page 1)

Origin of CoT success in LLM mathematical reasoning

Sponsor

Background

References

Related Problems