Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evolutionary Zero-shot CoT

Updated 17 March 2026
  • EoT is an evolutionary prompting technique that dynamically adapts chain-of-thought sequences using crossover and mutation for improved LLM reasoning.
  • It constructs a prompt pool from fixed seeds and leverages guided rewriting to optimize instance-specific prompts across diverse tasks.
  • Empirical results show EoT boosts accuracy by 1–4 points compared to standard zero-shot methods, rivaling few-shot approaches in multiple benchmarks.

Evolutionary Zero-shot Chain-of-Thought (EoT) is a prompting technique for LLMs that leverages evolutionary algorithms to optimize instance-specific chain-of-thought (CoT) prompts in a zero-shot setting. Distinct from conventional approaches that rely on a fixed reasoning prefix across all task inputs, EoT dynamically generates, evaluates, and deploys a population of prompt variants for each problem instance. This process employs evolutionary operators—including crossover and mutation—implemented by the LLM itself, culminating in the selection of the fittest prompt variant to guide both question rewriting and subsequent reasoning. Empirical results demonstrate that EoT surpasses standard zero-shot CoT prompting and rivals few-shot methods across arithmetic, commonsense, and symbolic reasoning tasks (Jin et al., 2024).

1. Formal Framework and Algorithmic Structure

In EoT, a chain-of-thought prompt is treated as a discrete genome, formalized as a sequence of tokens T=[w1,w2,...,wL]T = [w_1, w_2, ..., w_L], where wiw_i denotes each token. The evolutionary search commences with two fixed seed prompts:

  • T1T_1: “Let’s think step by step.”
  • T2T_2: “Let’s first understand the problem … then solve it step by step.”

The technique executes a single round of evolution via two primary operators:

  • Crossover: The LLM receives both T1T_1 and T2T_2 and is prompted to generate a crossover prompt TcT_c, typically by interleaving clauses, restructuring syntax, or combining phrasings:

Tc=LLM-Crossover(T1,T2)T_c = \text{LLM-Crossover}(T_1, T_2)

  • Mutation: The LLM subsequently mutates TcT_c to yield TmT_m, introducing linguistic edits such as refinement or synonym substitution:

Tm=LLM-Mutation(Tc)T_m = \text{LLM-Mutation}(T_c)

The resulting population is P={T1,T2,Tc,Tm}P = \{T_1, T_2, T_c, T_m\}. For each TPT \in P, the process proceeds with optional question rewriting and answer generation. The pipeline, incorporating prompt generation, evaluation, and guided reasoning, is concisely represented by the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input: Q (question), (T₁, T₂) (seeds)
Stage 1: Generate prompt pool
  T_c ← LLM("Crossover T₁ and T₂")
  T_m ← LLM("Mutate T_c")
  P ← [T₁, T₂, T_c, T_m]
Stage 2: Evaluate/select
  For T in P:
    Optionally rewrite Q under T to obtain R_T(Q)
    A_T ← LLM("Reason on R_T(Q), show rationales and answer")
    Compute f(T) (0/1 accuracy)
  T_o ← argmax_{T ∈ P} f(T)
Stage 3: Final reasoning
  R ← Rewrite(Q, T_o)
  (A*, C*) ← LLM("Using T_o on R, reason step by step and provide answer")
Return: A*

2. Fitness Evaluation and Selection Mechanisms

EoT employs the model itself to quantify prompt fitness. For each candidate TT, the pair (T,Q)(T, Q) is submitted to the LLM, which outputs an answer ATA_T. Fitness is computed as binary accuracy, f(T)=1AT=Atruef(T) = \mathbb{1}_{A_T = A_{true}}, against the ground-truth answer. In self-consistency mode, the model generates NN independent chains per (T,Q)(T, Q), and f(T)f(T) is determined via majority vote over the NN samples:

f(T)=majority-votei=1...N(AT,i)f(T) = \text{majority-vote}_{i=1...N}(A_{T,i})

Selection is straightforward due to the small pool size (P=4|P|=4): the prompt ToT_o with maximal f(T)f(T) is chosen. In case of ties, arbitrary resolution or additional LLM-based selection is employed.

3. Guided Rewriting Operation

Once the optimal prompt ToT_o is selected, the original question QQ is transformed through guided rewriting. The operation prepends or weaves ToT_o into QQ, yielding a reformulated input:

R(Q)=To“Now rewrite the question clearly using these instructions:”QR(Q) = T_o \parallel \text{“Now rewrite the question clearly using these instructions:”} \parallel Q

This rewritten input, coupled with the selected prompt, is then used to invoke the LLM for chain-of-thought generation CC and the final answer AA. The probabilistic answer generation can be characterized as:

P(ATo,R(Q))=CP(AC,To,R(Q))P(CTo,R(Q))P(A \mid T_o, R(Q)) = \sum_C P(A \mid C, T_o, R(Q)) \cdot P(C \mid T_o, R(Q))

This step is empirically critical, as ablation analyses indicate that omitting rewriting reduces performance by 0.5–1.3 points.

4. Experimental Protocols and Benchmarks

EoT's evaluation employs ten benchmarks across three reasoning categories:

  • Arithmetic: SingleEq, AddSub, GSM8K, MultiArith, SVAMP, AQuA
  • Commonsense: CommonsenseQA (CSQA), StrategyQA
  • Symbolic: Last Letter Concatenation, Coin Flip

Key experimental parameters include:

  • LLMs: GPT-3.5-turbo-0613 (greedy, T=0T=0); GPT-4 (ablation on AQuA, AddSub, SVAMP)
  • Prompt pool size: P=4|P|=4 (two seeds, one crossover, one mutation)
  • Self-consistency: N=16N=16 samples, temperature T=0.7T=0.7
  • Metric: accuracy (0/1)
  • Baselines: zero-shot CoT, Plan-and-Solve (PS/PS+), RE2 (“Read the question again + think step by step”), few-shot manual CoT, and automatic (cluster-based) few-shot CoT

5. Empirical Findings

EoT demonstrates consistent accuracy improvements relative to baselines. Representative results:

Dataset Type Baseline/Metric EoT Acc. (%) Comparison
Arithmetic (avg) Zero-shot CoT 83.5 +2.8 pts
PS+ 81.2 +2.3 pts
RE2 82.3 +1.2 pts
Commonsense CSQA 72.7 +1.2 pts (RE2)
StrategyQA 69.9
Symbolic Last Letter/Concat, CoinFlip 76.8/98.9 +2.5/1.2 pts
GPT-4 ablations AQuA/AddSub/SVAMP 75.9/97.7/92.7 +2.0/1.5/2.6 pts

Ablation studies reveal that omitting crossover costs 2.3–4.4 points, mutation 3.3–6.3 points, and rewriting 0.5–1.3 points. Increasing pool size monotonically improves accuracy (from 2 to 16). Optimal engineering of seeds affords further gains of up to 1.9 points.

Self-consistency combined with EoT augments accuracy by an additional 2–3 points relative to baselines with self-consistency.

6. Analysis, Limitations, and Future Directions

EoT's principal advantage derives from its per-instance prompt adaptation, sidestepping deficits of using a uniform reasoning prefix for all tasks. Single-round evolutionary search is both efficient and effective; extending to multiple rounds or larger populations may offer further improvements at increased computational cost. The guided rewriting step is especially impactful, focusing the LLM’s attention on task structure and the tailored instruction.

Current limitations include the restriction to crossover and mutation operators; untested variants such as differential evolution could provide additional improvements. The reliance on pure accuracy as the selection metric leaves unexplored alternatives such as likelihood or calibration-based fitness. The framework was primarily tested using closed-source LLM APIs (GPT-3.5-turbo, GPT-4); generality to open-source models remains to be assessed. Joint evolution of answer-extraction cues (e.g., “Therefore, the answer is…”) is another prospective axis.

EoT represents a lightweight evolutionary wrapper for CoT prompting, configuring a compact, instance-tailored prompt pool, and utilizing LLMs for both prompt generation and evaluation. Across arithmetic, commonsense, and symbolic reasoning, EoT consistently outperforms fixed zero-shot approaches and is competitive with few-shot methods, illustrating the effectiveness of evolutionary search mediated by LLMs' generative and evaluative capabilities (Jin et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evolutionary Zero-shot CoT (EoT).