Papers
Topics
Authors
Recent
Search
2000 character limit reached

Manual Chain-of-Thought Prompting

Updated 4 July 2026
  • ManualCoT is a framework where manually authored prompts provide explicit intermediate reasoning, structured as question, rationale, and answer triples.
  • It employs few-shot in-context learning, using carefully designed demonstrations to elicit multi-step reasoning in large language models.
  • Empirical results show significant performance gains in arithmetic and commonsense tasks, though the method incurs high annotation costs and sensitivity to prompt structure.

Searching arXiv for relevant papers on Manual Chain-of-Thought prompting and related work. Manual Chain-of-Thought (ManualCoT) denotes the human-authored form of chain-of-thought prompting in which a LLM is shown explicit intermediate reasoning before the final answer. In its narrow formulation, used as a baseline in later automation work, it is a few-shot in-context learning regime whose prompt consists of manually designed demonstrations, each a question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle triple. In broader survey usage, the term also covers human-designed trigger instructions and related hand-crafted rationale formats. Across the literature, ManualCoT is the foundational CoT paradigm from which zero-shot, automatic, semi-automatic, symbolic-aided, and human-in-the-loop variants are contrasted or derived (Wei et al., 2022, Zhang et al., 2022, Chu et al., 2023).

1. Canonical formulation

The canonical ManualCoT prompt is a few-shot prompt in which each exemplar contains a question, a reasoning chain, and an expected answer. The Auto-CoT paper defines the manual regime as “few-shot prompting with manual reasoning demonstrations one by one,” where “a reasoning chain is composed of a rationale (a series of intermediate reasoning steps) and an expected answer,” and all demonstrations are “manually designed” (Zhang et al., 2022). In the appendix to that work, a demonstration is explicitly defined as the triple question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle (Zhang et al., 2022).

Survey papers generalize this prompt schema in formal notation. One survey writes the CoT demonstration template as

TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},

where II is the instruction, xix_i the input, eie_i the explanation or reasoning trace, and yiy_i the answer (Chu et al., 2023). The same survey factorizes answer generation through the rationale: p(AT,Q)=p(AT,Q,R)p(RT,Q),p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}), which captures the core ManualCoT assumption that the rationale is an explicit intermediate object rather than incidental surface text (Chu et al., 2023).

In practical prompt formatting, the literature repeatedly uses a literal style of the form: Q: ... followed by A: Let's think step by step. ... The answer is ... (Zhang et al., 2022). The original CoT prompting paper manually composed such exemplars across arithmetic, commonsense, and symbolic tasks, typically using eight manually written exemplars for most math word problem datasets and four exemplars for AQuA (Wei et al., 2022). This format makes ManualCoT more specific than generic few-shot prompting: the prompt does not merely demonstrate input-output mappings, but demonstrates stepwise inferential structure.

2. Emergence and taxonomic position

ManualCoT became prominent with the demonstration that few-shot chain-of-thought prompting can elicit reasoning in sufficiently LLMs without finetuning (Wei et al., 2022). In that formulation, standard prompting uses input,output\langle \text{input}, \text{output} \rangle pairs, whereas chain-of-thought prompting uses input,chain of thought,output\langle \text{input}, \text{chain of thought}, \text{output} \rangle triples, with the chain of thought defined as “a series of intermediate natural language reasoning steps that lead to the final answer” (Wei et al., 2022). This established ManualCoT as the classical prompt-engineering regime for eliciting latent multi-step reasoning.

Subsequent taxonomies place ManualCoT inside broader construction-based classifications. The survey “Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future” classifies it as Manual XoT under XoT Construction, contrasting it with Automatic XoT and Semi-automatic XoT (Chu et al., 2023). In that survey, manual methods are characterized by human-written reasoning chains, manual example choice and formatting, and high quality at high annotation cost (Chu et al., 2023). The survey “Towards Better Chain-of-Thought Prompting Strategies” frames the same regime as human-designed prompts and demonstrations, including hand-written textual instructions such as “Let’s think step by step,” manually curated few-shot exemplars, and hand-authored rationale templates (Yu et al., 2023).

The narrow and broad senses of the term should be distinguished. In the narrow sense used by Auto-CoT, Manual-CoT is specifically the original few-shot CoT paradigm with manually designed demonstrations (Zhang et al., 2022). In broader survey usage, manual CoT includes both few-shot demonstrations and manually designed trigger instructions, including zero-shot triggers such as “Let’s think step by step” (Yu et al., 2023, Hu et al., 2024). This suggests that the unifying criterion is human authorship of the reasoning scaffold, not any single prompt length or decoding setup.

3. Empirical performance and scope

The original CoT prompting results show that ManualCoT is most effective on tasks requiring multi-step reasoning and at sufficiently large model scales. On GSM8K, accuracy improves from question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle0 to question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle1 for LaMDA 137B, from question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle2 to question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle3 for GPT-3 175B, and from question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle4 to question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle5 for PaLM 540B under CoT prompting (Wei et al., 2022). On commonsense reasoning, PaLM 540B improves from question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle6 to question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle7 on StrategyQA, from question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle8 to question,rationale,answer\langle \text{question}, \text{rationale}, \text{answer} \rangle9 on Sports Understanding, from TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},0 to TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},1 on Date Understanding, and from TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},2 to TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},3 on SayCan (Wei et al., 2022). On symbolic reasoning, the gains are especially large: for PaLM 540B, Last Letter Concatenation rises from TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},4 to TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},5 in-domain and from TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},6 to TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},7 out-of-domain; Coin Flip rises from TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},8 to TCoT={I,(x1,e1,y1),,(xn,en,yn)},\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},9 in-domain and from II0 to II1 under length generalization (Wei et al., 2022).

ManualCoT also serves as the principal human-crafted baseline in automatic prompting research. In the Auto-CoT study, representative Manual-CoT scores with GPT-3 include MultiArith II2, GSM8K II3, AddSub II4, AQuA II5, SingleEq II6, SVAMP II7, CSQA II8, StrategyQA II9, Letter xix_i0, and Coin Flip xix_i1 (Zhang et al., 2022). The same paper reports that Auto-CoT “consistently matches or exceeds” this manually designed CoT baseline across ten public benchmark reasoning tasks, which makes ManualCoT the reference point for evaluating automation rather than a deprecated method (Zhang et al., 2022).

ManualCoT was also shown to transfer beyond English. The multilingual reasoning paper evaluates direct prompting, Native-CoT, EN-CoT, and Translate-EN on MGSM, XCOPA, and XL-WiC, using 6-shot prompts whenever possible and greedy decoding with temperature xix_i2 (Shi et al., 2022). On MGSM, GPT-3 averages xix_i3 with Direct, xix_i4 with Native-CoT, xix_i5 with EN-CoT, and xix_i6 with Translate-EN; PaLM-540B averages xix_i7, xix_i8, xix_i9, and eie_i0, respectively (Shi et al., 2022). The same study reports that Bengali and Swahili each account for less than eie_i1 of the training corpus, yet PaLM still solves a substantial fraction of MGSM problems in those languages, and underrepresented languages are only about eie_i2 percentage points lower on average than high-resource languages, eie_i3 versus eie_i4 (Shi et al., 2022). In that setting, ManualCoT is not restricted to English demonstrations; it includes native-language rationales, English rationales for non-English inputs, and multilingual exemplar configurations.

4. Limitations, faithfulness, and annotation burden

The central limitation of ManualCoT is the need to hand-craft both the demonstration questions and their reasoning chains. The Auto-CoT paper states that the “superior performance hinges on the hand-drafting of effective demonstrations,” involving “nontrivial efforts in designs of both questions and their reasoning chains,” and that different tasks, such as arithmetic and commonsense reasoning, require different ways of demonstrations (Zhang et al., 2022). The surveys generalize the same point: manual methods are strong but expensive, face difficulty in demonstration selection and task generalization, and do not scale as easily as automatic or semi-automatic alternatives (Chu et al., 2023, Yu et al., 2023).

Empirically, ManualCoT is sensitive to the internal coherence of the demonstration traces. In an appendix study reported by Auto-CoT, the original Manual-CoT score is eie_i5, shuffling questions yields eie_i6, shuffling rationales yields eie_i7, and shuffling answers yields eie_i8 (Zhang et al., 2022). This indicates that rationale-answer consistency is more critical than the particular question order. The same paper reports that demonstrations written by different annotators can produce up to eie_i9 accuracy disparity in a symbolic reasoning task (Zhang et al., 2022). A plausible implication is that ManualCoT works not because it merely lengthens the prompt, but because it supplies a coherent inferential trace that the model can imitate.

Faithfulness remains an unresolved difficulty. The original CoT prompting paper notes that the model may produce a convincing chain that is logically wrong, and that correct final answers can sometimes arise from incorrect reasoning (Wei et al., 2022). The survey literature emphasizes “false coherence” and “unfaithfulness”: rationales may look plausible without reflecting actual reasoning, and even a wrong but coherent rationale can still help performance if it preserves logical structure (Yu et al., 2023). The Hopfieldian analysis later reinforces the instability of manual prompts by showing that standard CoT can improve or hurt performance depending on prompt wording and demonstration order, motivating representation-level alternatives such as RoT (Hu et al., 2024).

5. Variants and domain-specific reinterpretations

One major reinterpretation of ManualCoT is human-in-the-loop rationale editing. “Human-in-the-Loop through Chain-of-Thought” defines a Manual Correction System (MCS) in which the human does not solve the whole problem from scratch, but edits the model’s rationale at the level of sub-logics (Cai et al., 2023). MCS has four stages: sampling multiple rationales, filtering likely-wrong cases, manual correction of sub-logics, and answer regeneration with the corrected rationale (Cai et al., 2023). The human operations are modifying, adding, and deleting sub-logics, and the paper reports that up to yiy_i0 of CoT errors can be attributed to incorrect intermediate rationales that, if corrected, lead to the right answer (Cai et al., 2023). To decide when to involve humans, the method defines Diversity Entropy,

yiy_i1

using the answer distribution over sampled rationales; the main experiments send the top yiy_i2 by DE for manual correction (Cai et al., 2023). The same work adds a cost-utility model, CAMLOP, with budget constraint yiy_i3 and Cobb-Douglas utility yiy_i4, explicitly treating ManualCoT intervention as an economic optimization problem rather than only an accuracy maximization problem (Cai et al., 2023).

In neural code generation, ManualCoT is recast as expensive human-written “implementation ideas.” The COTTON paper contrasts manual CoT, few-shot prompting, and automatic CoT generation for lightweight LLMs under yiy_i5B parameters (Yang et al., 2023). There, a CoT is a natural-language implementation idea describing steps such as initialization, iteration, branching, and return logic. The authors argue that manual CoT is costly and not scalable, and empirically show that most lightweight models cannot independently generate high-quality CoTs through few-shot prompting, even though they can benefit from high-quality CoTs once those CoTs exist (Yang et al., 2023). COTTON responds by building CodeCoT-9k and fine-tuning CodeLlama-7B with instruction tuning and LoRA, treating CoT as a learnable intermediate artifact rather than a manually written prompt trick (Yang et al., 2023).

Other extensions preserve the manual prompting regime while restructuring the reasoning trace. Symbolic-Aided CoT is explicitly described as a non-iterative, single-turn refinement of CoT prompting that integrates lightweight symbolic representations into few-shot prompts through tokens such as Rule[i], KB, F(...), and Validate(...) (Nguyen et al., 17 Aug 2025). It remains manual prompting rather than external symbolic solving, but redesigns the rationale to expose rule selection, premise derivation, KB updating, and validation (Nguyen et al., 17 Aug 2025). In multimodal robotics, ManualVLA introduces a distinct use of the term ManualCoT: a multimodal intermediate manual consisting of text, 2D position prompts, subgoal images, and latent manual features that bridge goal-conditioned planning and action generation in a unified VLA model (Gu et al., 1 Dec 2025). This usage broadens ManualCoT beyond natural-language rationales while preserving the defining idea of explicit intermediate control conditions.

6. Mechanistic interpretations and research frontiers

Later work increasingly treats ManualCoT not only as a prompt design strategy but as an intervention on internal computation. From a Hopfieldian perspective, manual prompts and demonstrations are modeled as stimuli that move the model into low-dimensional representation spaces associated with reasoning (Hu et al., 2024). That work formalizes zero-shot CoT as yiy_i6, where yiy_i7 is a manual trigger such as “Let’s think step by step,” and few-shot CoT as yiy_i8, where yiy_i9 is a set of manually chosen demonstrations (Hu et al., 2024). It then defines contrastive hidden-state differences p(AT,Q)=p(AT,Q,R)p(RT,Q),p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),0, uses PCA to construct layerwise representation spaces p(AT,Q)=p(AT,Q,R)p(RT,Q),p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),1, and proposes Representation-of-Thought (RoT) as hidden-state steering: p(AT,Q)=p(AT,Q,R)p(RT,Q),p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),2 In that account, ManualCoT works because it steers the model into reasoning-favorable representational regimes, but remains brittle to prompt phrasing and exemplar order (Hu et al., 2024).

Other mechanistic analyses assign a more operational role to the tokens themselves. The paper “Chain-of-Thought Tokens are Computer Program Variables” argues that CoT tokens store intermediate values that are later read and updated by downstream steps, much like variables in computer programs (Zhu et al., 8 May 2025). On multi-digit multiplication and dynamic programming, preserving only tokens that store intermediate results yields comparable performance, and interventions on selected intermediate values causally change later CoT tokens and the final answer (Zhu et al., 8 May 2025). This suggests that ManualCoT is useful because it externalizes intermediate state, while the exact surface form of the rationale is secondary to the continued availability of that state.

A complementary analysis characterizes CoT as a decoding-space pruner rather than a faithful teacher of new reasoning content (Yang et al., 28 Jul 2025). In that account, manual CoT exemplars provide reusable answer templates such as “So the answer is ...”, and template adherence correlates strongly with improved accuracy on GSM8K (Yang et al., 28 Jul 2025). The same paper reports lower entropy in answer-token distributions under CoT than under standard prompting and finds task-dependent FFN activation changes: in later layers, CoT reduces activation for open-domain tasks such as GSM8K and Bamboogle, but increases activation for closed-domain tasks such as Coin Flip, AQuA, and Sports (Yang et al., 28 Jul 2025). Together with survey discussions of faithfulness, efficiency, and the need to move beyond linear chains, these mechanistic results shift the research frontier from prompt craftsmanship alone toward questions of internal state, structural imitation, and controllable reasoning representations (Chu et al., 2023, Yu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Manual Chain-of-Thought (ManualCoT).