Papers
Topics
Authors
Recent
Search
2000 character limit reached

Evolutionary Zero-shot CoT

Updated 17 March 2026
  • EoT is an evolutionary prompting technique that dynamically adapts chain-of-thought sequences using crossover and mutation for improved LLM reasoning.
  • It constructs a prompt pool from fixed seeds and leverages guided rewriting to optimize instance-specific prompts across diverse tasks.
  • Empirical results show EoT boosts accuracy by 1–4 points compared to standard zero-shot methods, rivaling few-shot approaches in multiple benchmarks.

Evolutionary Zero-shot Chain-of-Thought (EoT) is a prompting technique for LLMs that leverages evolutionary algorithms to optimize instance-specific chain-of-thought (CoT) prompts in a zero-shot setting. Distinct from conventional approaches that rely on a fixed reasoning prefix across all task inputs, EoT dynamically generates, evaluates, and deploys a population of prompt variants for each problem instance. This process employs evolutionary operators—including crossover and mutation—implemented by the LLM itself, culminating in the selection of the fittest prompt variant to guide both question rewriting and subsequent reasoning. Empirical results demonstrate that EoT surpasses standard zero-shot CoT prompting and rivals few-shot methods across arithmetic, commonsense, and symbolic reasoning tasks (Jin et al., 2024).

1. Formal Framework and Algorithmic Structure

In EoT, a chain-of-thought prompt is treated as a discrete genome, formalized as a sequence of tokens T=[w1,w2,...,wL]T = [w_1, w_2, ..., w_L], where wiw_i denotes each token. The evolutionary search commences with two fixed seed prompts:

  • T1T_1: “Let’s think step by step.”
  • T2T_2: “Let’s first understand the problem … then solve it step by step.”

The technique executes a single round of evolution via two primary operators:

  • Crossover: The LLM receives both T1T_1 and T2T_2 and is prompted to generate a crossover prompt TcT_c, typically by interleaving clauses, restructuring syntax, or combining phrasings:

Tc=LLM-Crossover(T1,T2)T_c = \text{LLM-Crossover}(T_1, T_2)

  • Mutation: The LLM subsequently mutates TcT_c to yield TmT_m, introducing linguistic edits such as refinement or synonym substitution:

wiw_i0

The resulting population is wiw_i1. For each wiw_i2, the process proceeds with optional question rewriting and answer generation. The pipeline, incorporating prompt generation, evaluation, and guided reasoning, is concisely represented by the following pseudocode:

T2T_27

2. Fitness Evaluation and Selection Mechanisms

EoT employs the model itself to quantify prompt fitness. For each candidate wiw_i3, the pair wiw_i4 is submitted to the LLM, which outputs an answer wiw_i5. Fitness is computed as binary accuracy, wiw_i6, against the ground-truth answer. In self-consistency mode, the model generates wiw_i7 independent chains per wiw_i8, and wiw_i9 is determined via majority vote over the T1T_10 samples:

T1T_11

Selection is straightforward due to the small pool size (T1T_12): the prompt T1T_13 with maximal T1T_14 is chosen. In case of ties, arbitrary resolution or additional LLM-based selection is employed.

3. Guided Rewriting Operation

Once the optimal prompt T1T_15 is selected, the original question T1T_16 is transformed through guided rewriting. The operation prepends or weaves T1T_17 into T1T_18, yielding a reformulated input:

T1T_19

This rewritten input, coupled with the selected prompt, is then used to invoke the LLM for chain-of-thought generation T2T_20 and the final answer T2T_21. The probabilistic answer generation can be characterized as:

T2T_22

This step is empirically critical, as ablation analyses indicate that omitting rewriting reduces performance by 0.5–1.3 points.

4. Experimental Protocols and Benchmarks

EoT's evaluation employs ten benchmarks across three reasoning categories:

  • Arithmetic: SingleEq, AddSub, GSM8K, MultiArith, SVAMP, AQuA
  • Commonsense: CommonsenseQA (CSQA), StrategyQA
  • Symbolic: Last Letter Concatenation, Coin Flip

Key experimental parameters include:

  • LLMs: GPT-3.5-turbo-0613 (greedy, T2T_23); GPT-4 (ablation on AQuA, AddSub, SVAMP)
  • Prompt pool size: T2T_24 (two seeds, one crossover, one mutation)
  • Self-consistency: T2T_25 samples, temperature T2T_26
  • Metric: accuracy (0/1)
  • Baselines: zero-shot CoT, Plan-and-Solve (PS/PS+), RE2 (“Read the question again + think step by step”), few-shot manual CoT, and automatic (cluster-based) few-shot CoT

5. Empirical Findings

EoT demonstrates consistent accuracy improvements relative to baselines. Representative results:

Dataset Type Baseline/Metric EoT Acc. (%) Comparison
Arithmetic (avg) Zero-shot CoT 83.5 +2.8 pts
PS+ 81.2 +2.3 pts
RE2 82.3 +1.2 pts
Commonsense CSQA 72.7 +1.2 pts (RE2)
StrategyQA 69.9
Symbolic Last Letter/Concat, CoinFlip 76.8/98.9 +2.5/1.2 pts
GPT-4 ablations AQuA/AddSub/SVAMP 75.9/97.7/92.7 +2.0/1.5/2.6 pts

Ablation studies reveal that omitting crossover costs 2.3–4.4 points, mutation 3.3–6.3 points, and rewriting 0.5–1.3 points. Increasing pool size monotonically improves accuracy (from 2 to 16). Optimal engineering of seeds affords further gains of up to 1.9 points.

Self-consistency combined with EoT augments accuracy by an additional 2–3 points relative to baselines with self-consistency.

6. Analysis, Limitations, and Future Directions

EoT's principal advantage derives from its per-instance prompt adaptation, sidestepping deficits of using a uniform reasoning prefix for all tasks. Single-round evolutionary search is both efficient and effective; extending to multiple rounds or larger populations may offer further improvements at increased computational cost. The guided rewriting step is especially impactful, focusing the LLM’s attention on task structure and the tailored instruction.

Current limitations include the restriction to crossover and mutation operators; untested variants such as differential evolution could provide additional improvements. The reliance on pure accuracy as the selection metric leaves unexplored alternatives such as likelihood or calibration-based fitness. The framework was primarily tested using closed-source LLM APIs (GPT-3.5-turbo, GPT-4); generality to open-source models remains to be assessed. Joint evolution of answer-extraction cues (e.g., “Therefore, the answer is…”) is another prospective axis.

EoT represents a lightweight evolutionary wrapper for CoT prompting, configuring a compact, instance-tailored prompt pool, and utilizing LLMs for both prompt generation and evaluation. Across arithmetic, commonsense, and symbolic reasoning, EoT consistently outperforms fixed zero-shot approaches and is competitive with few-shot methods, illustrating the effectiveness of evolutionary search mediated by LLMs' generative and evaluative capabilities (Jin et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Evolutionary Zero-shot CoT (EoT).