Manual Chain-of-Thought Prompting

Updated 4 July 2026

ManualCoT is a framework where manually authored prompts provide explicit intermediate reasoning, structured as question, rationale, and answer triples.
It employs few-shot in-context learning, using carefully designed demonstrations to elicit multi-step reasoning in large language models.
Empirical results show significant performance gains in arithmetic and commonsense tasks, though the method incurs high annotation costs and sensitivity to prompt structure.

Searching arXiv for relevant papers on Manual Chain-of-Thought prompting and related work. Manual Chain-of-Thought (ManualCoT) denotes the human-authored form of chain-of-thought prompting in which a LLM is shown explicit intermediate reasoning before the final answer. In its narrow formulation, used as a baseline in later automation work, it is a few-shot in-context learning regime whose prompt consists of manually designed demonstrations, each a $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ triple. In broader survey usage, the term also covers human-designed trigger instructions and related hand-crafted rationale formats. Across the literature, ManualCoT is the foundational CoT paradigm from which zero-shot, automatic, semi-automatic, symbolic-aided, and human-in-the-loop variants are contrasted or derived (Wei et al., 2022, Zhang et al., 2022, Chu et al., 2023).

1. Canonical formulation

The canonical ManualCoT prompt is a few-shot prompt in which each exemplar contains a question, a reasoning chain, and an expected answer. The Auto-CoT paper defines the manual regime as “few-shot prompting with manual reasoning demonstrations one by one,” where “a reasoning chain is composed of a rationale (a series of intermediate reasoning steps) and an expected answer,” and all demonstrations are “manually designed” (Zhang et al., 2022). In the appendix to that work, a demonstration is explicitly defined as the triple $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ (Zhang et al., 2022).

Survey papers generalize this prompt schema in formal notation. One survey writes the CoT demonstration template as

$\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$

where $I$ is the instruction, $x_i$ the input, $e_i$ the explanation or reasoning trace, and $y_i$ the answer (Chu et al., 2023). The same survey factorizes answer generation through the rationale: $p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),$ which captures the core ManualCoT assumption that the rationale is an explicit intermediate object rather than incidental surface text (Chu et al., 2023).

In practical prompt formatting, the literature repeatedly uses a literal style of the form: Q: ... followed by A: Let's think step by step. ... The answer is ... (Zhang et al., 2022). The original CoT prompting paper manually composed such exemplars across arithmetic, commonsense, and symbolic tasks, typically using eight manually written exemplars for most math word problem datasets and four exemplars for AQuA (Wei et al., 2022). This format makes ManualCoT more specific than generic few-shot prompting: the prompt does not merely demonstrate input-output mappings, but demonstrates stepwise inferential structure.

2. Emergence and taxonomic position

ManualCoT became prominent with the demonstration that few-shot chain-of-thought prompting can elicit reasoning in sufficiently LLMs without finetuning (Wei et al., 2022). In that formulation, standard prompting uses $\langle \text{input}, \text{output} \rangle$ pairs, whereas chain-of-thought prompting uses $\langle \text{input}, \text{chain of thought}, \text{output} \rangle$ triples, with the chain of thought defined as “a series of intermediate natural language reasoning steps that lead to the final answer” (Wei et al., 2022). This established ManualCoT as the classical prompt-engineering regime for eliciting latent multi-step reasoning.

Subsequent taxonomies place ManualCoT inside broader construction-based classifications. The survey “Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future” classifies it as Manual XoT under XoT Construction, contrasting it with Automatic XoT and Semi-automatic XoT (Chu et al., 2023). In that survey, manual methods are characterized by human-written reasoning chains, manual example choice and formatting, and high quality at high annotation cost (Chu et al., 2023). The survey “Towards Better Chain-of-Thought Prompting Strategies” frames the same regime as human-designed prompts and demonstrations, including hand-written textual instructions such as “Let’s think step by step,” manually curated few-shot exemplars, and hand-authored rationale templates (Yu et al., 2023).

The narrow and broad senses of the term should be distinguished. In the narrow sense used by Auto-CoT, Manual-CoT is specifically the original few-shot CoT paradigm with manually designed demonstrations (Zhang et al., 2022). In broader survey usage, manual CoT includes both few-shot demonstrations and manually designed trigger instructions, including zero-shot triggers such as “Let’s think step by step” (Yu et al., 2023, Hu et al., 2024). This suggests that the unifying criterion is human authorship of the reasoning scaffold, not any single prompt length or decoding setup.

3. Empirical performance and scope

The original CoT prompting results show that ManualCoT is most effective on tasks requiring multi-step reasoning and at sufficiently large model scales. On GSM8K, accuracy improves from $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 0 to $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 1 for LaMDA 137B, from $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 2 to $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 3 for GPT-3 175B, and from $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 4 to $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 5 for PaLM 540B under CoT prompting (Wei et al., 2022). On commonsense reasoning, PaLM 540B improves from $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 6 to $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 7 on StrategyQA, from $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 8 to $\langle \text{question}, \text{rationale}, \text{answer} \rangle$ 9 on Sports Understanding, from $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 0 to $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 1 on Date Understanding, and from $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 2 to $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 3 on SayCan (Wei et al., 2022). On symbolic reasoning, the gains are especially large: for PaLM 540B, Last Letter Concatenation rises from $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 4 to $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 5 in-domain and from $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 6 to $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 7 out-of-domain; Coin Flip rises from $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 8 to $\mathcal{T}_{\mathrm{CoT}} = \{ I, (x_1, e_1, y_1), \cdots, (x_n, e_n, y_n)\},$ 9 in-domain and from $I$ 0 to $I$ 1 under length generalization (Wei et al., 2022).

ManualCoT also serves as the principal human-crafted baseline in automatic prompting research. In the Auto-CoT study, representative Manual-CoT scores with GPT-3 include MultiArith $I$ 2, GSM8K $I$ 3, AddSub $I$ 4, AQuA $I$ 5, SingleEq $I$ 6, SVAMP $I$ 7, CSQA $I$ 8, StrategyQA $I$ 9, Letter $x_i$ 0, and Coin Flip $x_i$ 1 (Zhang et al., 2022). The same paper reports that Auto-CoT “consistently matches or exceeds” this manually designed CoT baseline across ten public benchmark reasoning tasks, which makes ManualCoT the reference point for evaluating automation rather than a deprecated method (Zhang et al., 2022).

ManualCoT was also shown to transfer beyond English. The multilingual reasoning paper evaluates direct prompting, Native-CoT, EN-CoT, and Translate-EN on MGSM, XCOPA, and XL-WiC, using 6-shot prompts whenever possible and greedy decoding with temperature $x_i$ 2 (Shi et al., 2022). On MGSM, GPT-3 averages $x_i$ 3 with Direct, $x_i$ 4 with Native-CoT, $x_i$ 5 with EN-CoT, and $x_i$ 6 with Translate-EN; PaLM-540B averages $x_i$ 7, $x_i$ 8, $x_i$ 9, and $e_i$ 0, respectively (Shi et al., 2022). The same study reports that Bengali and Swahili each account for less than $e_i$ 1 of the training corpus, yet PaLM still solves a substantial fraction of MGSM problems in those languages, and underrepresented languages are only about $e_i$ 2 percentage points lower on average than high-resource languages, $e_i$ 3 versus $e_i$ 4 (Shi et al., 2022). In that setting, ManualCoT is not restricted to English demonstrations; it includes native-language rationales, English rationales for non-English inputs, and multilingual exemplar configurations.

4. Limitations, faithfulness, and annotation burden

The central limitation of ManualCoT is the need to hand-craft both the demonstration questions and their reasoning chains. The Auto-CoT paper states that the “superior performance hinges on the hand-drafting of effective demonstrations,” involving “nontrivial efforts in designs of both questions and their reasoning chains,” and that different tasks, such as arithmetic and commonsense reasoning, require different ways of demonstrations (Zhang et al., 2022). The surveys generalize the same point: manual methods are strong but expensive, face difficulty in demonstration selection and task generalization, and do not scale as easily as automatic or semi-automatic alternatives (Chu et al., 2023, Yu et al., 2023).

Empirically, ManualCoT is sensitive to the internal coherence of the demonstration traces. In an appendix study reported by Auto-CoT, the original Manual-CoT score is $e_i$ 5, shuffling questions yields $e_i$ 6, shuffling rationales yields $e_i$ 7, and shuffling answers yields $e_i$ 8 (Zhang et al., 2022). This indicates that rationale-answer consistency is more critical than the particular question order. The same paper reports that demonstrations written by different annotators can produce up to $e_i$ 9 accuracy disparity in a symbolic reasoning task (Zhang et al., 2022). A plausible implication is that ManualCoT works not because it merely lengthens the prompt, but because it supplies a coherent inferential trace that the model can imitate.

Faithfulness remains an unresolved difficulty. The original CoT prompting paper notes that the model may produce a convincing chain that is logically wrong, and that correct final answers can sometimes arise from incorrect reasoning (Wei et al., 2022). The survey literature emphasizes “false coherence” and “unfaithfulness”: rationales may look plausible without reflecting actual reasoning, and even a wrong but coherent rationale can still help performance if it preserves logical structure (Yu et al., 2023). The Hopfieldian analysis later reinforces the instability of manual prompts by showing that standard CoT can improve or hurt performance depending on prompt wording and demonstration order, motivating representation-level alternatives such as RoT (Hu et al., 2024).

5. Variants and domain-specific reinterpretations

One major reinterpretation of ManualCoT is human-in-the-loop rationale editing. “Human-in-the-Loop through Chain-of-Thought” defines a Manual Correction System (MCS) in which the human does not solve the whole problem from scratch, but edits the model’s rationale at the level of sub-logics (Cai et al., 2023). MCS has four stages: sampling multiple rationales, filtering likely-wrong cases, manual correction of sub-logics, and answer regeneration with the corrected rationale (Cai et al., 2023). The human operations are modifying, adding, and deleting sub-logics, and the paper reports that up to $y_i$ 0 of CoT errors can be attributed to incorrect intermediate rationales that, if corrected, lead to the right answer (Cai et al., 2023). To decide when to involve humans, the method defines Diversity Entropy,

$y_i$ 1

using the answer distribution over sampled rationales; the main experiments send the top $y_i$ 2 by DE for manual correction (Cai et al., 2023). The same work adds a cost-utility model, CAMLOP, with budget constraint $y_i$ 3 and Cobb-Douglas utility $y_i$ 4, explicitly treating ManualCoT intervention as an economic optimization problem rather than only an accuracy maximization problem (Cai et al., 2023).

In neural code generation, ManualCoT is recast as expensive human-written “implementation ideas.” The COTTON paper contrasts manual CoT, few-shot prompting, and automatic CoT generation for lightweight LLMs under $y_i$ 5B parameters (Yang et al., 2023). There, a CoT is a natural-language implementation idea describing steps such as initialization, iteration, branching, and return logic. The authors argue that manual CoT is costly and not scalable, and empirically show that most lightweight models cannot independently generate high-quality CoTs through few-shot prompting, even though they can benefit from high-quality CoTs once those CoTs exist (Yang et al., 2023). COTTON responds by building CodeCoT-9k and fine-tuning CodeLlama-7B with instruction tuning and LoRA, treating CoT as a learnable intermediate artifact rather than a manually written prompt trick (Yang et al., 2023).

Other extensions preserve the manual prompting regime while restructuring the reasoning trace. Symbolic-Aided CoT is explicitly described as a non-iterative, single-turn refinement of CoT prompting that integrates lightweight symbolic representations into few-shot prompts through tokens such as Rule[i], KB, F(...), and Validate(...) (Nguyen et al., 17 Aug 2025). It remains manual prompting rather than external symbolic solving, but redesigns the rationale to expose rule selection, premise derivation, KB updating, and validation (Nguyen et al., 17 Aug 2025). In multimodal robotics, ManualVLA introduces a distinct use of the term ManualCoT: a multimodal intermediate manual consisting of text, 2D position prompts, subgoal images, and latent manual features that bridge goal-conditioned planning and action generation in a unified VLA model (Gu et al., 1 Dec 2025). This usage broadens ManualCoT beyond natural-language rationales while preserving the defining idea of explicit intermediate control conditions.

6. Mechanistic interpretations and research frontiers

Later work increasingly treats ManualCoT not only as a prompt design strategy but as an intervention on internal computation. From a Hopfieldian perspective, manual prompts and demonstrations are modeled as stimuli that move the model into low-dimensional representation spaces associated with reasoning (Hu et al., 2024). That work formalizes zero-shot CoT as $y_i$ 6, where $y_i$ 7 is a manual trigger such as “Let’s think step by step,” and few-shot CoT as $y_i$ 8, where $y_i$ 9 is a set of manually chosen demonstrations (Hu et al., 2024). It then defines contrastive hidden-state differences $p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),$ 0, uses PCA to construct layerwise representation spaces $p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),$ 1, and proposes Representation-of-Thought (RoT) as hidden-state steering: $p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}) = p(\mathcal{A}\mid \mathcal{T}, \mathcal{Q}, \mathcal{R})\, p(\mathcal{R}\mid \mathcal{T}, \mathcal{Q}),$ 2 In that account, ManualCoT works because it steers the model into reasoning-favorable representational regimes, but remains brittle to prompt phrasing and exemplar order (Hu et al., 2024).

Other mechanistic analyses assign a more operational role to the tokens themselves. The paper “Chain-of-Thought Tokens are Computer Program Variables” argues that CoT tokens store intermediate values that are later read and updated by downstream steps, much like variables in computer programs (Zhu et al., 8 May 2025). On multi-digit multiplication and dynamic programming, preserving only tokens that store intermediate results yields comparable performance, and interventions on selected intermediate values causally change later CoT tokens and the final answer (Zhu et al., 8 May 2025). This suggests that ManualCoT is useful because it externalizes intermediate state, while the exact surface form of the rationale is secondary to the continued availability of that state.

A complementary analysis characterizes CoT as a decoding-space pruner rather than a faithful teacher of new reasoning content (Yang et al., 28 Jul 2025). In that account, manual CoT exemplars provide reusable answer templates such as “So the answer is ...”, and template adherence correlates strongly with improved accuracy on GSM8K (Yang et al., 28 Jul 2025). The same paper reports lower entropy in answer-token distributions under CoT than under standard prompting and finds task-dependent FFN activation changes: in later layers, CoT reduces activation for open-domain tasks such as GSM8K and Bamboogle, but increases activation for closed-domain tasks such as Coin Flip, AQuA, and Sports (Yang et al., 28 Jul 2025). Together with survey discussions of faithfulness, efficiency, and the need to move beyond linear chains, these mechanistic results shift the research frontier from prompt craftsmanship alone toward questions of internal state, structural imitation, and controllable reasoning representations (Chu et al., 2023, Yu et al., 2023).