Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot (2506.14641v1)

Published 17 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: In-Context Learning (ICL) is an essential emergent ability of LLMs, and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

Summary

Exploring the Efficacy of Chain-of-Thought Prompting in Recent LLMs

The paper "Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot" examines the role and effectiveness of Chain-of-Thought (CoT) prompting in modern LLMs. This research primarily investigates whether CoT exemplars have any tangible impact on enhancing the reasoning capabilities of contemporary LLMs, particularly in mathematical tasks. The paper systematically evaluates zero-shot versus few-shot prompting methods to determine their effect on the reasoning performance of various strong LLMs, such as the Qwen2.5 series.

Key Findings

Zero-Shot CoT Effectiveness: The research reveals that in the context of mathematical reasoning, zero-shot CoT prompting consistently achieves robust performance, often surpassing few-shot CoT scenarios. This challenges the assumption that few-shot techniques—including exemplar-based CoT—are essential for enhancing reasoning abilities in recent LLMs.
Exemplar Functionality: The primary function of CoT exemplars, according to the experimental results, is aligning output format with human expectations rather than improving reasoning skills. This finding indicates that exemplars serve more as formatting guides for model outputs instead of contributing to reasoning enhancement.
Enhanced CoT Exemplars: Attempts to improve CoT exemplars by using answers from advanced models like Qwen2.5-Max and DeepSeek-R1 did not yield significant improvements in reasoning performance. Models tended to prioritize the instructions over the content of sophisticated exemplars, suggesting an inherent limitation in the current CoT strategies.
Model's Attention Behavior: Attention analysis indicates that models focus primarily on the actual instructions and test questions while largely ignoring the exemplar content, which explains the lack of reasoning gains from CoT exemplars.

Implications and Future Directions

Rethinking ICL Paradigm: The findings call for a reevaluation of the In-Context Learning (ICL) paradigm, particularly in mathematical contexts, where traditional CoT strategies may not be sufficient or necessary for already powerful models. This necessitates exploring alternative exemplar designs that could better capture complex reasoning processes without being sidelined by instruction-focused approaches.
Development of New Prompting Strategies: Given the demonstrated limitations of CoT exemplars in enhancing reasoning abilities, future research could pivot toward developing new strategies that leverage intrinsic model capabilities more effectively. This could include optimizing instruction prompts or devising new frameworks that integrate reasoning pathways intrinsically rather than through predesigned exemplars.
Evaluation Frameworks: The paper highlights potential biases in existing evaluation frameworks, notably the misalignment of output formats. Improved evaluation metrics and methodologies are needed to accurately assess LLM reasoning capabilities without being skewed by format alignment issues.

In summation, while CoT prompting has been a valuable tool in the early stages of LLM development, the paper strongly suggests its diminished utility in light of recent advancements in model capabilities. Consequently, this research invites the academic community to forge new paths in the design and optimization of LLM reasoning strategies, ensuring they remain effective and pertinent as models continue to evolve.

Related Papers

YouTube

Show All Videos