Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 94 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 13 tok/s

GPT-5 High 17 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 460 tok/s Pro

Kimi K2 198 tok/s Pro

2000 character limit reached

Evolving Deeper LLM Thinking (2501.09891v1)

Published 17 Jan 2025 in cs.AI

Abstract: We explore an evolutionary search strategy for scaling inference time compute in LLMs. The proposed approach, Mind Evolution, uses a LLM to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.

Collections

Summary

The paper presents Mind Evolution, an evolutionary search strategy that iteratively refines candidate solutions via structured critic-author roles.
It employs a custom evaluator and genetic algorithm principles, using Boltzmann tournament selection, to enhance natural language planning outputs.
Experimental results show significant improvements over baselines, achieving near perfect success rates on planning benchmarks on tasks like TravelPlanner and Meeting Planning.

This paper introduces Mind Evolution, a technique designed to enhance the problem-solving capabilities of LLMs by applying more computational effort during inference (2501.09891). It tackles complex tasks, particularly natural language planning problems, where generating a correct solution is difficult, but evaluating a proposed solution is feasible programmatically. Mind Evolution employs an evolutionary search strategy operating directly on natural language representations of solutions.

The core idea is inspired by genetic algorithms. A population of candidate solutions (e.g., travel plans expressed in text) is maintained and iteratively improved across generations. The key components and implementation flow are:

Population Initialization: The process starts by prompting an LLM to generate an initial set of diverse candidate solutions for the given problem.
Fitness Evaluation: A crucial component is a custom-built, programmatic evaluator for the specific task. This evaluator must:
- Parse the natural language candidate solution.
- Score the solution based on how well it meets objectives and constraints (fitness score).
- Verify constraint satisfaction.
- Provide textual feedback detailing errors or constraint violations. The ablation studies show this textual feedback is critical for performance (2501.09891).
Selection: Solutions are selected probabilistically based on their fitness scores (using Boltzmann tournament selection) to become "parents" for the next generation. Higher-scoring solutions are more likely to be chosen, but diversity is maintained.
Recombination (Crossover & Mutation): Selected parent solutions are fed back into the LLM. Using specially designed prompts, the LLM is instructed to combine ideas from the parents and generate improved "child" solutions. This step inherently performs both crossover (mixing parent ideas) and mutation (introducing novel variations via the LLM's generation process).
Refinement through Critical Conversation (RCC): This is a key process used both during initialization and recombination. It involves prompting the LLM to adopt two roles:
- Critic: Analyzes the input solution(s) and the evaluator's textual feedback, identifying flaws and suggesting improvements.
- Author: Takes the original solution(s), feedback, and the critic's analysis to propose a refined solution. This structured conversation loop (evaluate -> critique -> refine) helps the LLM focus on improving specific weaknesses. The ablation paper confirms the critic step significantly boosts performance (2501.09891).
Island Model: To further enhance diversity and avoid premature convergence, Mind Evolution uses an island model. The population is split into sub-populations ("islands") that evolve semi-independently. Periodically:
- Migration: Top-performing solutions from one island are copied to the next, allowing good solutions to spread.
- Island Reset: Low-performing islands are reset and repopulated with diverse, high-quality solutions selected from the global population, potentially using the LLM itself to ensure diversity among the selected elites (2501.09891).
Termination: The process repeats for a fixed number of generations ( $N_{\text{gens}}$ ) or until a solution meeting all constraints is found.

Practical Implementation Considerations:

Evaluator Development: The main engineering effort lies in creating the task-specific evaluator. While potentially easier than building a solver, it requires careful parsing of the LLM's natural language output and implementing logic to check all constraints and objectives. The evaluator's quality directly impacts search effectiveness. Appendix A.2 provides details on the evaluators built for the benchmark tasks (2501.09891).
Prompt Engineering: The performance heavily relies on the prompts used to guide the LLM for initialization, RCC (critic/author roles), recombination, and potentially island reset. The paper provides example prompts in Appendix A.1 (2501.09891). These often include general instructions, few-shot examples, the task description, parent solutions with feedback, and specific critical thinking instructions.
Hyperparameter Tuning: Several hyperparameters influence the search (see Table 1 (2501.09891)), including the number of generations ( $N_{\text{gens}}$ ), islands ( $N_{\text{island}}$ ), conversations per island ( $N_{\text{convs}}$ ), refinement steps per conversation ( $N_{\text{seq}}$ ), reset frequency, migration size, etc. Default values are provided, but tuning may be necessary for optimal performance on new tasks or with different LLMs.
Computational Cost: Mind Evolution uses significantly more inference computation than a single LLM call or even Best-of-N sampling. The paper provides detailed cost breakdowns (LLM calls, tokens, API cost) in Table 2 and Figures 7-9, 16 (2501.09891). Practitioners must weigh the improved success rate against the increased cost.
LLM Choice & Two-Stage Approach: The choice of LLM (e.g., Gemini 1.5 Flash vs. Pro) impacts both cost and capability. The paper proposes a cost-effective two-stage strategy: run Mind Evolution with a cheaper model (Flash) first, and only use a more powerful model (Pro) on the instances that remain unsolved (2501.09891).

Applications & Performance:

Mind Evolution was evaluated on challenging natural language planning benchmarks: TravelPlanner [xie2024travelplanner] and Natural Plan (Trip Planning, Meeting Planning) [zheng2024natural].

Results: It significantly outperformed baselines like 1-Pass generation, Best-of-N sampling, and Sequential Revision (similar to multi-turn Reflexion [shinn2024reflexion]) in terms of final success rate, often achieving this with comparable or lower token counts than sequential revision methods (Table 2 (2501.09891)).
High Success Rates: Using Gemini 1.5 Flash, it achieved >95% success on TravelPlanner and Trip Planning validation sets, and 85% on Meeting Planning. The two-stage approach with Gemini 1.5 Pro pushed these rates to nearly 100% on TravelPlanner and Trip Planning, and over 98% on Meeting Planning (2501.09891).
No Formal Solver: Notably, these results were achieved without requiring the LLM to translate the problem into a formal representation or using an external symbolic solver, unlike some prior high-performing approaches on these tasks [hao2024large]. This makes Mind Evolution applicable to domains where formalization is difficult or impractical.

In summary, Mind Evolution offers a practical framework for leveraging LLMs and inference-time compute to solve complex natural language-based problems, particularly planning and constraint satisfaction tasks. Its main advantage is the ability to operate directly in the natural language space, bypassing the need for formal problem specification, provided a programmatic solution evaluator can be implemented. The key implementation steps involve building this evaluator and carefully crafting prompts to guide the evolutionary search process driven by the LLM.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (7)

Tweets

https://twitter.com/_akhaliq/status/1881182843302445212

https://twitter.com/fly51fly/status/1881357991984992540

https://twitter.com/_philschmid/status/1881258629007126762

https://twitter.com/rohanpaul_ai/status/1883861376441147457

https://twitter.com/gkcs_/status/1883153402689171491

https://twitter.com/kuanghueilee/status/1882620340406513719

YouTube

Show All Videos

HackerNews

Evolving Deeper LLM Thinking (12 points, 0 comments)
Evolving Deeper LLM Thinking (1 point, 0 comments)

[Google DeepMind] Evolving Deeper LLM Thinking (316 points, 56 comments)
[R] Evolving Deeper LLM Thinking (41 points, 4 comments)

Evolving Deeper LLM Thinking (2501.09891v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (7)

Tweets

YouTube

HackerNews

Reddit

Don't miss out on important new AI/ML research

Evolving Deeper LLM Thinking (2501.09891v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (7)

Tweets

YouTube

HackerNews

Reddit

Don't miss out on important new AI/ML research