To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning (2409.12183v2)

Published 18 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from LLMs. But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

PDF Abstract

Chain-of-Thought Effectiveness: A Focused Analysis

The paper "To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning" authored by Sprague et al., provides a rigorous examination of where Chain-of-Thought (CoT) prompting is most beneficial in leveraging the reasoning capabilities of LLMs. This comprehensive paper includes a quantitative meta-analysis of over 100 papers and empirical evaluations using 20 datasets across 14 LLMs. The findings reveal that CoT demonstrates significant performance advantages primarily in mathematical and logical reasoning tasks, with its effectiveness being much less pronounced in other areas.

Key Findings and Implications

Quantitative Meta-Analysis

The meta-analysis unveils that CoT significantly benefits tasks involving symbolic reasoning, including mathematical problems and formal logic. The paper lists average performance improvements of 14.2% for symbolic reasoning, 12.3% for mathematical reasoning, and 6.9% for logical reasoning. In contrast, tasks that require commonsense reasoning, encyclopedic knowledge, and other linguistic tasks do not show substantial gains when CoT is applied. Specifically, performance in these categories with CoT remains nearly identical to that achieved with direct answering (DA).

Experimental Evaluations

Empirical experiments conducted by the authors corroborate the meta-analysis findings. Evaluations on datasets such as MATH, GSM8K, and symbolic tasks like ContextHub and MuSR reveal that CoT significantly outperforms DA. For instance, CoT achieves performance gains as large as 41.6% on MATH and 66.9% on GSM8K. Conversely, in non-symbolic tasks including those requiring commonsense and commonsense QA datasets, CoT does not offer a substantial advantage.

Notably, focusing on the dataset MMLU, it was observed that CoT enhances performance primarily for questions involving an equals sign, indicative of symbolic operations. This observation is quantified as 95% of the benefit of CoT on MMLU being attributed to math-related questions containing an "=", with non-math questions showing negligible improvements.

Analysis of CoT Mechanisms

The dissected analysis highlights that the utility of CoT is predominantly tied to its capacity to handle symbolic execution more effectively than DA prompts. Symbolic reasoning problems can be broken down into distinct stages of planning and execution. CoT enhances model performance primarily during the execution phase, particularly when handling intermediate symbolic computations. The paper also compares performance against LLMs augmented with external symbolic solvers, finding that these tool-assisted frameworks frequently outperform CoT-only approaches.

Future Directions and Considerations

The results suggest that the reliance on CoT for a wide range of NLP tasks is often unwarranted, as its benefits are circumscribed mainly to domains requiring symbolic reasoning and intermediate computational steps. This realization encourages the exploration of alternative prompting strategies or hybrid models incorporating decision-making capabilities of LLMs augmented with external reasoning tools like symbolic solvers.

The research underscores the necessity for advancing beyond prompt-based CoT to more sophisticated methods such as search-based approaches, interacting agents, and further fine-tuning of models specifically for rational completion. These methodologies may unlock better performance on complex reasoning tasks outside the traditional domains of mathematics and formal logic.

Conclusion

Sprague et al.'s paper serves as a pivotal reference to understand the constraints and strengths of CoT prompting in LLMs. By presenting a meticulous meta-analysis and several high-fidelity experiments, the paper delineates the specific scenarios where CoT is significantly advantageous. The findings encourage a directed and nuanced use of CoT, guiding future research towards innovative paradigms that can harness intermediate computation effectively across diverse reasoning tasks in NLP.

The insights from this paper are crucial for researchers and practitioners aiming to deploy LLMs for reasoning-intensive applications, emphasizing the selective use of CoT and the potential of integrating LLMs with external computational tools to meet specific task demands efficiently.