Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition (2504.03930v1)

Published 4 Apr 2025 in cs.AI, cs.CC, and cs.LG

Abstract: LLMs have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. However, recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features. To study the reasoning capabilities in a principled fashion, we adopt a computational theory perspective and propose an experimental protocol centered on 3-SAT -- the prototypical NP-complete problem lying at the core of logical reasoning and constraint satisfaction tasks. Specifically, we examine the phase transitions in random 3-SAT and characterize the reasoning abilities of state-of-the-art LLMs by varying the inherent hardness of the problem instances. By comparing DeepSeek R1 with other LLMs, our findings reveal two key insights (1) LLM accuracy drops significantly on harder instances, suggesting all current models struggle when statistical shortcuts are unavailable (2) Unlike other LLMs, R1 shows signs of having learned the underlying reasoning. Following a principled experimental protocol, our study moves beyond the benchmark-driven evidence often found in LLM reasoning research. Our findings highlight important gaps and suggest clear directions for future research.

Summary

The paper investigates LLM reasoning via 3-SAT phase transitions and shows performance drops near the critical threshold.
DeepSeek R1 demonstrates superior, human-like search strategies and reduced reliance on statistical shortcuts compared to other LLMs.
The study highlights current limitations in LLM reasoning, advocating neurosymbolic integration with classical solvers for reliable performance.

This paper investigates whether LLMs have truly learned to reason or if their apparent reasoning abilities stem from fitting statistical patterns. The authors argue that standard benchmarks can be misleading due to potential data contamination and the conflation of commonsense and logical reasoning. To address this, they propose a principled evaluation framework centered on the 3-Satisfiability (3-SAT) problem, a fundamental NP-complete problem core to logical reasoning and constraint satisfaction tasks (Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition, 4 Apr 2025).

The core idea is to leverage the known "phase transition" phenomenon in random 3-SAT problems. The difficulty of random 3-SAT instances varies sharply with the ratio $\alpha = \text{number of clauses} / \text{number of variables}$ . Instances far below or far above the critical threshold ( $\alpha_c \approx 4.267$ ) are typically "easy" to solve (often satisfiable if under-constrained, unsatisfiable if over-constrained), while instances near the threshold are computationally "hard". The authors hypothesize that if LLMs rely on statistical shortcuts, their performance will be high in the easy regions but drop significantly in the hard region where such shortcuts are scarce, requiring genuine multi-step reasoning.

Methodology:

Problem Formulation: 3-SAT problems were presented to LLMs in two formats:
- SAT-Menu: A natural language menu-selection puzzle mapping variables to food items and clauses to individual preferences (Box 1).
- SAT-CNF: A direct representation of the formula in Conjunctive Normal Form (CNF) using lists of integers (Box 4).
Tasks: LLMs were evaluated on both the decision problem (determining satisfiability: SAT/unSAT) and the search problem (finding a satisfying assignment if one exists).
Dataset: Random 3-SAT instances were generated across a range of $\alpha$ values (for $n=3$ to $10$ variables) spanning the easy and hard regions (Appendix A). Similar datasets were generated for 2-SAT and 1-3 Horn-SAT to test performance on problems in different complexity classes (NL-Complete and P-Complete, respectively).
Models Evaluated: State-of-the-art autoregressive LLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash, DeepSeek V3) and a Large Reasoning Model, DeepSeek R1 (chosen for its accessible, autoregressive Chain-of-Thought "thinking" tokens).
Evaluation Metrics: Accuracy on the search and decision tasks, correlated with the hardness parameter $\alpha$ and the satisfiability ratio (fraction of possible assignments that satisfy the formula). Qualitative analysis of R1's CoT traces was also performed.

Key Findings:

Phase Transition Behavior: All tested LLMs exhibited an inverted phase transition pattern on 3-SAT search tasks – high accuracy in easy regions ( $\alpha \ll \alpha_c$ or $\alpha \gg \alpha_c$ ) and a significant drop in the hard region ( $\alpha \approx \alpha_c$ ) (Figure 4 [Left]). This supports the hypothesis that current LLMs struggle when statistical shortcuts are unavailable. Performance also correlated positively with the number of satisfying solutions, except for R1 (Figure 4 [Right], Figure 11).
DeepSeek R1's Superiority: DeepSeek R1 significantly outperformed other LLMs across all regions, especially the hard one. Its performance was less dependent on the satisfiability ratio. R1 also achieved near-perfect accuracy on the simpler 2-SAT and Horn-SAT problems, unlike other LLMs which still showed performance dips near the 2-SAT phase transition (Figure 7).
R1 Internalizes Search: Qualitative analysis of R1's CoT traces (Figures 5, 6) revealed patterns resembling symbolic search algorithms like DPLL/CDCL. R1 demonstrated tree search navigation, use of heuristics (e.g., Unit Propagation, MOMS, Pure Literal Elimination), backtracking (including backjumping), self-reflection upon conflicts, and self-correction. This suggests R1 may have learned aspects of the underlying reasoning process rather than just mimicking output style.
R1 Limitations: Despite its advanced behavior, R1's reasoning was imperfect. It suffered from incompleteness (prematurely ending search), limited soundness (making logical errors), contained verbose narration in its CoT, and showed sensitivity to input format (performing better on SAT-CNF than SAT-Menu).
Computational Effort: R1's output token count scaled polynomially with input token count, particularly increasing in the hard region, suggesting it adapts its computational effort (CoT length) to problem difficulty (Figure 7 [Right], Figure 12). This contrasts with other LLMs whose output length remained relatively constant.
Neurosymbolic Integration: An experiment translating SAT-Menu problems to CNF for an external MiniSAT solver (SAT-Translate) showed near-perfect accuracy, highlighting the current gap between LLM reasoning and dedicated solvers and the potential of hybrid approaches (Figure 9).

Conclusion:

The paper concludes that most contemporary LLMs struggle with hard reasoning tasks that lack statistical shortcuts, as demonstrated by their performance drop near the 3-SAT phase transition. While DeepSeek R1 shows promising signs of having learned more fundamental reasoning strategies resembling symbolic search, it still falls short of classical solvers in terms of completeness and soundness. The 3-SAT phase transition offers a rigorous method for evaluating LLM reasoning capabilities beyond potentially saturated standard benchmarks. For practical, reliable reasoning, integrating LLMs with symbolic solvers remains advisable (Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition, 4 Apr 2025).

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/OpenMalware/status/1936069127309021210

https://twitter.com/RishiHazra95/status/1909533214646886514

YouTube

Show All Videos