Evolutionary Zero-shot CoT

Updated 17 March 2026

EoT is an evolutionary prompting technique that dynamically adapts chain-of-thought sequences using crossover and mutation for improved LLM reasoning.
It constructs a prompt pool from fixed seeds and leverages guided rewriting to optimize instance-specific prompts across diverse tasks.
Empirical results show EoT boosts accuracy by 1–4 points compared to standard zero-shot methods, rivaling few-shot approaches in multiple benchmarks.

Evolutionary Zero-shot Chain-of-Thought (EoT) is a prompting technique for LLMs that leverages evolutionary algorithms to optimize instance-specific chain-of-thought (CoT) prompts in a zero-shot setting. Distinct from conventional approaches that rely on a fixed reasoning prefix across all task inputs, EoT dynamically generates, evaluates, and deploys a population of prompt variants for each problem instance. This process employs evolutionary operators—including crossover and mutation—implemented by the LLM itself, culminating in the selection of the fittest prompt variant to guide both question rewriting and subsequent reasoning. Empirical results demonstrate that EoT surpasses standard zero-shot CoT prompting and rivals few-shot methods across arithmetic, commonsense, and symbolic reasoning tasks (Jin et al., 2024).

1. Formal Framework and Algorithmic Structure

In EoT, a chain-of-thought prompt is treated as a discrete genome, formalized as a sequence of tokens $T = [w_1, w_2, ..., w_L]$ , where $w_i$ denotes each token. The evolutionary search commences with two fixed seed prompts:

$T_1$ : “Let’s think step by step.”
$T_2$ : “Let’s first understand the problem … then solve it step by step.”

The technique executes a single round of evolution via two primary operators:

Crossover: The LLM receives both $T_1$ and $T_2$ and is prompted to generate a crossover prompt $T_c$ , typically by interleaving clauses, restructuring syntax, or combining phrasings:

$T_c = \text{LLM-Crossover}(T_1, T_2)$

Mutation: The LLM subsequently mutates $T_c$ to yield $T_m$ , introducing linguistic edits such as refinement or synonym substitution:

$T_m = \text{LLM-Mutation}(T_c)$

The resulting population is $P = \{T_1, T_2, T_c, T_m\}$ . For each $T \in P$ , the process proceeds with optional question rewriting and answer generation. The pipeline, incorporating prompt generation, evaluation, and guided reasoning, is concisely represented by the following pseudocode:

Input: Q (question), (T₁, T₂) (seeds)
Stage 1: Generate prompt pool
  T_c ← LLM("Crossover T₁ and T₂")
  T_m ← LLM("Mutate T_c")
  P ← [T₁, T₂, T_c, T_m]
Stage 2: Evaluate/select
  For T in P:
    Optionally rewrite Q under T to obtain R_T(Q)
    A_T ← LLM("Reason on R_T(Q), show rationales and answer")
    Compute f(T) (0/1 accuracy)
  T_o ← argmax_{T ∈ P} f(T)
Stage 3: Final reasoning
  R ← Rewrite(Q, T_o)
  (A*, C*) ← LLM("Using T_o on R, reason step by step and provide answer")
Return: A*

2. Fitness Evaluation and Selection Mechanisms

EoT employs the model itself to quantify prompt fitness. For each candidate $T$ , the pair $(T, Q)$ is submitted to the LLM, which outputs an answer $A_T$ . Fitness is computed as binary accuracy, $f(T) = \mathbb{1}_{A_T = A_{true}}$ , against the ground-truth answer. In self-consistency mode, the model generates $N$ independent chains per $(T, Q)$ , and $f(T)$ is determined via majority vote over the $N$ samples:

$f(T) = \text{majority-vote}_{i=1...N}(A_{T,i})$

Selection is straightforward due to the small pool size ( $|P|=4$ ): the prompt $T_o$ with maximal $f(T)$ is chosen. In case of ties, arbitrary resolution or additional LLM-based selection is employed.

3. Guided Rewriting Operation

Once the optimal prompt $T_o$ is selected, the original question $Q$ is transformed through guided rewriting. The operation prepends or weaves $T_o$ into $Q$ , yielding a reformulated input:

$R(Q) = T_o \parallel \text{“Now rewrite the question clearly using these instructions:”} \parallel Q$

This rewritten input, coupled with the selected prompt, is then used to invoke the LLM for chain-of-thought generation $C$ and the final answer $A$ . The probabilistic answer generation can be characterized as:

$P(A \mid T_o, R(Q)) = \sum_C P(A \mid C, T_o, R(Q)) \cdot P(C \mid T_o, R(Q))$

This step is empirically critical, as ablation analyses indicate that omitting rewriting reduces performance by 0.5–1.3 points.

4. Experimental Protocols and Benchmarks

EoT's evaluation employs ten benchmarks across three reasoning categories:

Arithmetic: SingleEq, AddSub, GSM8K, MultiArith, SVAMP, AQuA
Commonsense: CommonsenseQA (CSQA), StrategyQA
Symbolic: Last Letter Concatenation, Coin Flip

Key experimental parameters include:

LLMs: GPT-3.5-turbo-0613 (greedy, $T=0$ ); GPT-4 (ablation on AQuA, AddSub, SVAMP)
Prompt pool size: $|P|=4$ (two seeds, one crossover, one mutation)
Self-consistency: $N=16$ samples, temperature $T=0.7$
Metric: accuracy (0/1)
Baselines: zero-shot CoT, Plan-and-Solve (PS/PS+), RE2 (“Read the question again + think step by step”), few-shot manual CoT, and automatic (cluster-based) few-shot CoT

5. Empirical Findings

EoT demonstrates consistent accuracy improvements relative to baselines. Representative results:

Dataset Type	Baseline/Metric	EoT Acc. (%)	Comparison
Arithmetic (avg)	Zero-shot CoT	83.5	+2.8 pts
	PS+	81.2	+2.3 pts
	RE2	82.3	+1.2 pts
Commonsense	CSQA	72.7	+1.2 pts (RE2)
	StrategyQA	69.9
Symbolic	Last Letter/Concat, CoinFlip	76.8/98.9	+2.5/1.2 pts
GPT-4 ablations	AQuA/AddSub/SVAMP	75.9/97.7/92.7	+2.0/1.5/2.6 pts

Ablation studies reveal that omitting crossover costs 2.3–4.4 points, mutation 3.3–6.3 points, and rewriting 0.5–1.3 points. Increasing pool size monotonically improves accuracy (from 2 to 16). Optimal engineering of seeds affords further gains of up to 1.9 points.

Self-consistency combined with EoT augments accuracy by an additional 2–3 points relative to baselines with self-consistency.

6. Analysis, Limitations, and Future Directions

EoT's principal advantage derives from its per-instance prompt adaptation, sidestepping deficits of using a uniform reasoning prefix for all tasks. Single-round evolutionary search is both efficient and effective; extending to multiple rounds or larger populations may offer further improvements at increased computational cost. The guided rewriting step is especially impactful, focusing the LLM’s attention on task structure and the tailored instruction.

Current limitations include the restriction to crossover and mutation operators; untested variants such as differential evolution could provide additional improvements. The reliance on pure accuracy as the selection metric leaves unexplored alternatives such as likelihood or calibration-based fitness. The framework was primarily tested using closed-source LLM APIs (GPT-3.5-turbo, GPT-4); generality to open-source models remains to be assessed. Joint evolution of answer-extraction cues (e.g., “Therefore, the answer is…”) is another prospective axis.

EoT represents a lightweight evolutionary wrapper for CoT prompting, configuring a compact, instance-tailored prompt pool, and utilizing LLMs for both prompt generation and evaluation. Across arithmetic, commonsense, and symbolic reasoning, EoT consistently outperforms fixed zero-shot approaches and is competitive with few-shot methods, illustrating the effectiveness of evolutionary search mediated by LLMs' generative and evaluative capabilities (Jin et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Zero-Shot Chain-of-Thought Reasoning Guided by Evolutionary Algorithms in Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evolutionary Zero-shot CoT (EoT).

Evolutionary Zero-shot CoT

1. Formal Framework and Algorithmic Structure

2. Fitness Evaluation and Selection Mechanisms

3. Guided Rewriting Operation

4. Experimental Protocols and Benchmarks

5. Empirical Findings

6. Analysis, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Evolutionary Zero-shot CoT

1. Formal Framework and Algorithmic Structure

2. Fitness Evaluation and Selection Mechanisms

3. Guided Rewriting Operation

4. Experimental Protocols and Benchmarks

5. Empirical Findings

6. Analysis, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research