Evolutionary Zero-shot CoT
- EoT is an evolutionary prompting technique that dynamically adapts chain-of-thought sequences using crossover and mutation for improved LLM reasoning.
- It constructs a prompt pool from fixed seeds and leverages guided rewriting to optimize instance-specific prompts across diverse tasks.
- Empirical results show EoT boosts accuracy by 1–4 points compared to standard zero-shot methods, rivaling few-shot approaches in multiple benchmarks.
Evolutionary Zero-shot Chain-of-Thought (EoT) is a prompting technique for LLMs that leverages evolutionary algorithms to optimize instance-specific chain-of-thought (CoT) prompts in a zero-shot setting. Distinct from conventional approaches that rely on a fixed reasoning prefix across all task inputs, EoT dynamically generates, evaluates, and deploys a population of prompt variants for each problem instance. This process employs evolutionary operators—including crossover and mutation—implemented by the LLM itself, culminating in the selection of the fittest prompt variant to guide both question rewriting and subsequent reasoning. Empirical results demonstrate that EoT surpasses standard zero-shot CoT prompting and rivals few-shot methods across arithmetic, commonsense, and symbolic reasoning tasks (Jin et al., 2024).
1. Formal Framework and Algorithmic Structure
In EoT, a chain-of-thought prompt is treated as a discrete genome, formalized as a sequence of tokens , where denotes each token. The evolutionary search commences with two fixed seed prompts:
- : “Let’s think step by step.”
- : “Let’s first understand the problem … then solve it step by step.”
The technique executes a single round of evolution via two primary operators:
- Crossover: The LLM receives both and and is prompted to generate a crossover prompt , typically by interleaving clauses, restructuring syntax, or combining phrasings:
- Mutation: The LLM subsequently mutates to yield , introducing linguistic edits such as refinement or synonym substitution:
The resulting population is . For each , the process proceeds with optional question rewriting and answer generation. The pipeline, incorporating prompt generation, evaluation, and guided reasoning, is concisely represented by the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Input: Q (question), (T₁, T₂) (seeds)
Stage 1: Generate prompt pool
T_c ← LLM("Crossover T₁ and T₂")
T_m ← LLM("Mutate T_c")
P ← [T₁, T₂, T_c, T_m]
Stage 2: Evaluate/select
For T in P:
Optionally rewrite Q under T to obtain R_T(Q)
A_T ← LLM("Reason on R_T(Q), show rationales and answer")
Compute f(T) (0/1 accuracy)
T_o ← argmax_{T ∈ P} f(T)
Stage 3: Final reasoning
R ← Rewrite(Q, T_o)
(A*, C*) ← LLM("Using T_o on R, reason step by step and provide answer")
Return: A* |
2. Fitness Evaluation and Selection Mechanisms
EoT employs the model itself to quantify prompt fitness. For each candidate , the pair is submitted to the LLM, which outputs an answer . Fitness is computed as binary accuracy, , against the ground-truth answer. In self-consistency mode, the model generates independent chains per , and is determined via majority vote over the samples:
Selection is straightforward due to the small pool size (): the prompt with maximal is chosen. In case of ties, arbitrary resolution or additional LLM-based selection is employed.
3. Guided Rewriting Operation
Once the optimal prompt is selected, the original question is transformed through guided rewriting. The operation prepends or weaves into , yielding a reformulated input:
This rewritten input, coupled with the selected prompt, is then used to invoke the LLM for chain-of-thought generation and the final answer . The probabilistic answer generation can be characterized as:
This step is empirically critical, as ablation analyses indicate that omitting rewriting reduces performance by 0.5–1.3 points.
4. Experimental Protocols and Benchmarks
EoT's evaluation employs ten benchmarks across three reasoning categories:
- Arithmetic: SingleEq, AddSub, GSM8K, MultiArith, SVAMP, AQuA
- Commonsense: CommonsenseQA (CSQA), StrategyQA
- Symbolic: Last Letter Concatenation, Coin Flip
Key experimental parameters include:
- LLMs: GPT-3.5-turbo-0613 (greedy, ); GPT-4 (ablation on AQuA, AddSub, SVAMP)
- Prompt pool size: (two seeds, one crossover, one mutation)
- Self-consistency: samples, temperature
- Metric: accuracy (0/1)
- Baselines: zero-shot CoT, Plan-and-Solve (PS/PS+), RE2 (“Read the question again + think step by step”), few-shot manual CoT, and automatic (cluster-based) few-shot CoT
5. Empirical Findings
EoT demonstrates consistent accuracy improvements relative to baselines. Representative results:
| Dataset Type | Baseline/Metric | EoT Acc. (%) | Comparison |
|---|---|---|---|
| Arithmetic (avg) | Zero-shot CoT | 83.5 | +2.8 pts |
| PS+ | 81.2 | +2.3 pts | |
| RE2 | 82.3 | +1.2 pts | |
| Commonsense | CSQA | 72.7 | +1.2 pts (RE2) |
| StrategyQA | 69.9 | ||
| Symbolic | Last Letter/Concat, CoinFlip | 76.8/98.9 | +2.5/1.2 pts |
| GPT-4 ablations | AQuA/AddSub/SVAMP | 75.9/97.7/92.7 | +2.0/1.5/2.6 pts |
Ablation studies reveal that omitting crossover costs 2.3–4.4 points, mutation 3.3–6.3 points, and rewriting 0.5–1.3 points. Increasing pool size monotonically improves accuracy (from 2 to 16). Optimal engineering of seeds affords further gains of up to 1.9 points.
Self-consistency combined with EoT augments accuracy by an additional 2–3 points relative to baselines with self-consistency.
6. Analysis, Limitations, and Future Directions
EoT's principal advantage derives from its per-instance prompt adaptation, sidestepping deficits of using a uniform reasoning prefix for all tasks. Single-round evolutionary search is both efficient and effective; extending to multiple rounds or larger populations may offer further improvements at increased computational cost. The guided rewriting step is especially impactful, focusing the LLM’s attention on task structure and the tailored instruction.
Current limitations include the restriction to crossover and mutation operators; untested variants such as differential evolution could provide additional improvements. The reliance on pure accuracy as the selection metric leaves unexplored alternatives such as likelihood or calibration-based fitness. The framework was primarily tested using closed-source LLM APIs (GPT-3.5-turbo, GPT-4); generality to open-source models remains to be assessed. Joint evolution of answer-extraction cues (e.g., “Therefore, the answer is…”) is another prospective axis.
EoT represents a lightweight evolutionary wrapper for CoT prompting, configuring a compact, instance-tailored prompt pool, and utilizing LLMs for both prompt generation and evaluation. Across arithmetic, commonsense, and symbolic reasoning, EoT consistently outperforms fixed zero-shot approaches and is competitive with few-shot methods, illustrating the effectiveness of evolutionary search mediated by LLMs' generative and evaluative capabilities (Jin et al., 2024).