Zero-shot Chain-of-Thought Prompting

Updated 16 December 2025

Zero-shot Chain-of-Thought prompting is a test-time protocol that directs large language models to generate explicit multi-step reasoning without in-context examples.
It leverages a simple instruction trigger to induce stepwise rationales, matching or surpassing few-shot methods on benchmarks like GSM8K and MATH.
Recent innovations include structured formats (tabular, hierarchical) and adaptive strategies that enhance reasoning interpretability and output accuracy.

Zero-shot Chain-of-Thought (CoT) prompting is a test-time inference protocol for LLMs in which the model is guided to generate explicit multi-step reasoning traces in response to a single natural-language instruction appended to a question, without any in-context examples or parameter updates. This mechanism elicits reasoning path generation, improves accuracy on a range of mathematical, logical, and commonsense tasks, and reveals emergent capabilities in high-capacity LLMs. Recent work demonstrates that, as LLMs grow stronger, zero-shot CoT prompting can match or outperform few-shot CoT protocols and that novel variants—including hierarchical, adaptive, cross-lingual, tabular, and multimodal instantiations—further expand its reach and reliability.

1. Formal Definitions and Protocols

Zero-shot CoT prompting exploits the LLM's ability to condition on a chain-of-thought trigger (e.g. “Please reason step by step”) appended to the question, inducing the model to output stepwise rationales prior to the answer (Cheng et al., 17 Jun 2025). Let $q$ be the question and $p$ a CoT instruction. The input is $x = [q; p]$ , with output $(r, a)$ sampled from $P(r, a\,|\,q, p)$ .

Contrast with Few-shot CoT.

Few-shot CoT presents $k$ pairs $(q_i, a_i)$ as demonstrations, where each $a_i$ contains a multi-step rationale, then asks the test question.
Zero-shot CoT provides no such exemplars: only the instruction (e.g. “Please reason step by step, and put your final answer within $\boxed{\cdots}$ ”) directly follows the test question.

Prompt templates vary:

Standard: “Let’s think step by step.”
Structured: “Step 1: …”, “Step 2: …” (COT STEP) (Chowdhury et al., 21 Jan 2025)
Plan–Solve: “Let’s first understand the problem and devise a plan…” (Wang et al., 2023)
Tabular: “|step|subquestion|process|result|” (Jin et al., 2023)
Hierarchical or domain-specific: three-stage CoT for trajectory analysis (Xie et al., 14 Oct 2025)

2. Empirical Performance and Comparative Benchmarks

Recent multi-model, multi-task evaluations establish that zero-shot CoT often matches or surpasses few-shot CoT in strong models.

Mathematics (GSM8K, MATH, etc.): Qwen2.5-72B achieves 81.2% on GSM8K in zero-shot CoT vs. 79.0% for 8-shot CoT; on MATH, 55.3% (0-shot) against 53.8% (8-shot) (Cheng et al., 17 Jun 2025).
General QA: GPT-4 using the “Zhou” CoT trigger (“Answer: Let’s work this out in a step by step way to be sure we have the right answer.”) yields Krippendorff’s $\alpha$ up to 0.83 vs 0.71 for direct prompting (Hebenstreit et al., 2023).
Tab-CoT: Table-driven zero-shot CoT format delivers large gains, e.g. +13.1 points average accuracy over baseline CoT across arithmetic, symbolic, and commonsense reasoning benchmarks (Jin et al., 2023).
Plan-and-Solve (PS+): Detailed variable extraction and intermediate calculation instructions in zero-shot plans close the gap to few-shot, yielding 76.7% on multi-step math vs. 77.6% for 8-shot CoT (Wang et al., 2023).
Adaptive protocols: Instance-adaptive prompting (IAP) and evolutionary algorithms for per-question prompt selection yield further improvements: IAP-mv lifts GSM8K accuracy by ~2–4 pp; evolutionary zero-shot CoT (EoT) boosts arithmetic average from 80.7% to 83.5% (Yuan et al., 30 Sep 2024, Jin et al., 8 Feb 2024).

Retrieval-based and enhanced CoT exemplars fail to outperform zero-shot CoT on top-tier models, confirming that internalized multi-step reasoning dominates explicit example conditioning (Cheng et al., 17 Jun 2025). Ablations show that exemplar formatting mainly standardizes output (e.g. $\boxed{\cdots}$ ), with masked or noisy exemplars reducing accuracy to zero-shot levels and attention signatures focusing on the instruction and question.

3. Methodological Innovations in Zero-shot CoT

Zero-shot CoT has rapidly diversified beyond its classic string-based protocol, spawning several notable methodological advances:

Structured Reasoning Traces:

COT STEP: Explicit “Step $n$ :” markers yield parseable and verifiable chains, facilitating downstream zero-shot verification (self-checking) (Chowdhury et al., 21 Jan 2025).
PS/PS+: Two-stage plan-and-solve pipelines enforce high-level planning and variable extraction, markedly reducing calculation and missing-step errors (Wang et al., 2023).
Tab-CoT: CoT reasoning in table format enables horizontal (per-step) and vertical (cross-step) logic propagation, outperforming text-only traces (Jin et al., 2023).
Hint of Thought (HoT): Explicit sub-question decomposition, logical pseudocode, and answer formatting make reasoning interpretable and transferable (Lei et al., 2023).

Adaptive and Selection Strategies:

Instance-adaptive CoT: Saliency-based attention flow analysis identifies per-instance optimal prompt triggers, rather than a static task-level instruction (Yuan et al., 30 Sep 2024).
Evolutionary Algorithms: Prompt populations evolve via LLM-guided crossover/mutation, with in-model selection yielding diverse and effective CoT triggers suited to each question (Jin et al., 8 Feb 2024).
Uncertainty-driven selection (ZEUS): Predictive entropy over CoT chain answers guides in-context demonstration selection without label access, outperforming temperature-only baselines (Kumar et al., 30 Nov 2024).
Adaptive Injection Decoding (AID): Monitors model output; if “<eos>” ranks highly, injects “Well” to nudge continuation, boosting reasoning chain completeness and accuracy (Jin et al., 13 Mar 2025).

Hierarchical and Domain-Targeted CoT:

Hierarchical CoT for semantic trajectories (HiCoTraj): Multi-stage factual, behavioral, and demographic abstraction enables zero-shot mobility-based prediction (Xie et al., 14 Oct 2025).
Cross-lingual CoT protocols (CLP, AutoCAP): Two-stage alignment and solver prompting (plus automatic language/weight selection) extend zero-shot CoT to non-English and multi-language ensembles, delivering up to +8% gains (Qin et al., 2023, 2406.13940).
Multimodal CoT (CCoT, DDCoT, MMPlanner): Scene-graph generation (CCoT), duty-distinct decomposition (DDCoT), and object-state induction (MMPlanner) drive zero-shot CoT in visual and mixed domains; all strictly test-time, parameter-free prompting (Mitra et al., 2023, Zheng et al., 2023, Tabassum et al., 25 Sep 2025).

4. Analytical Insights and Mechanisms

Comprehensive quantitative and analytical studies reveal why zero-shot CoT succeeds, especially with advanced LLMs:

Attention Analysis: Self-attention heads in strong models concentrate almost exclusively on the instruction and test question, with minimal weight on in-context exemplars (Cheng et al., 17 Jun 2025). Saliency flow analysis identifies that both question→prompt and question→rationale flows are essential for “good” reasoning; failing either leads to “bad” CoT chains (Yuan et al., 30 Sep 2024).
Pretraining Effects: Modern LLMs are exposed during instruction tuning to vast quantities of CoT and multi-step traces, internalizing these patterns to the extent that explicit example conditioning is often ignored.
Format vs. Reasoning: Traditional exemplars provide format scaffolding (i.e., answer extraction cues), but bring negligible reasoning benefit in strong LLMs. Noise-masked or content-degraded exemplars still allow robust performance due to reliance on model-internalized reasoning skills (Cheng et al., 17 Jun 2025).
Prompt Diversity and Adaptivity: Per-instance prompt engineering (instance-adaptive, evolutionary, uncertainty-guided) outperforms static triggers. Adaptive decoding nudges, prompt specialization (Tab-CoT headers, hierarchical scaffolding), and language/resource selection all enhance chain quality.
Cross-lingual and Multimodal Effects: Alignment and solver separation, plus weighted language ensemble (AutoCAP), enable robust multilingual CoT reasoning. Tabular, hierarchical, and object-state-driven CoT protocols generalize well to symbolic, procedural, and multimodal reasoning environments.

5. Limitations, Failure Modes, and Design Guidelines

Despite strong empirical performance, zero-shot CoT and its variants face several documented limitations and challenges.

Commonsense and Non-mathematical Reasoning: Verification and self-consistency scores may not improve selection in hard commonsense tasks (e.g. CommonsenseQA) under zero-shot regimes (Chowdhury et al., 21 Jan 2025). Tabular and stepwise protocols plateau on highly unstructured or commonsense-biased domains (Jin et al., 2023).
Model Size and Pretraining: Small or non-instruction-tuned models may not respond to CoT triggers, especially table-formatted protocols, due to lack of pretraining on structured documents (Jin et al., 2023).
Prompt Sensitivity: Slight changes in wordings and step arrangements can induce up to 4% accuracy swings; excessive decomposition or redundant step injection (e.g. Hint-of-Thought $n>5$ ) causes hallucination and chain degradation (Lei et al., 2023, Kumar et al., 30 Nov 2024).
Reliance on English: Multilingual generalization demands explicit alignment-pivot and solver separation (CLP); extremely low-resource languages may hallucinate (Qin et al., 2023, 2406.13940).
Formatting and Extraction Biases: Evaluation scripts must match zero-shot output format (e.g. extracting from $\boxed{\cdots}$ rather than the last digit) to avoid artificial performance drops (Cheng et al., 17 Jun 2025).
Multimodal Reasoning Risks: Scene-graph-driven reasoning is impaired by missed object–relation extractions and context window truncation; duty-distinct protocols can inherit base-LM biases or fail complex recognition without pretraining (Mitra et al., 2023, Zheng et al., 2023).

Design Recommendations:

Prefer explicit planning triggers, variable extraction, and intermediate-result instructions in prompt engineering (Wang et al., 2023).
Two-stage or tabular protocols can enhance multi-step accuracy; adapt header columns to task structure (Jin et al., 2023).
Adaptive, per-instance prompt selection and entropy-guided demonstration choice outperform static task-level triggers (Yuan et al., 30 Sep 2024, Kumar et al., 30 Nov 2024).
In cross-lingual tasks, use automatic language selection and weight allocation rather than manual or uniform ensembles (2406.13940).
Multimodal cases benefit from duty-distinct reasoning–recognition splits and explicit chaining of visual and textual inference (Zheng et al., 2023, Tabassum et al., 25 Sep 2025).

6. Implications for Future In-Context Learning and Prompt Engineering

Findings from recent studies require a reexamination of the ICL+CoT paradigm. The classical assumption that “more, better, or highly tailored exemplars always yield superior reasoning” no longer holds in top-performing LLMs. For modern models, instructional triggers and adaptive prompt selection supersede in-context demonstration quality. Zero-shot CoT effectively leverages the model's internalized multi-step patterns, rendering extensive example retrieval, few-shot engineering, or complex context assembly non-essential for reasoning performance.

This calls for redirection of prompt design focus—towards refining instruction tokens, exploiting structured and adaptive prompt templates, and constructing interactive or feedback-driven prompting sequences that further unlock LLM capabilities. Exemplar utility remains for smaller or less capable models, but primarily for standardizing output rather than improving solution accuracy. Evaluation protocols must carefully separate format effects from true reasoning capability, and further innovations should target hierarchical decomposition, domain-centric chain structuring, and adaptive, instance-level selection mechanisms.

The continued evolution and diversification of zero-shot CoT prompting, encompassing adaptive, structured, multilingual, and multimodal methodologies, is central to advancing rigorous, scalable, and interpretable reasoning in foundation LLMs (Cheng et al., 17 Jun 2025, Wang et al., 2023, Yuan et al., 30 Sep 2024, Qin et al., 2023, Kumar et al., 30 Nov 2024, Lei et al., 2023, Jin et al., 2023, Jin et al., 8 Feb 2024, Jin et al., 13 Mar 2025, Xie et al., 14 Oct 2025, Mitra et al., 2023, Zheng et al., 2023, Chowdhury et al., 21 Jan 2025, Tabassum et al., 25 Sep 2025, 2406.13940, Hebenstreit et al., 2023).