Evolving Deeper LLM Thinking (2501.09891v1)
Abstract: We explore an evolutionary search strategy for scaling inference time compute in LLMs. The proposed approach, Mind Evolution, uses a LLM to generate, recombine and refine candidate responses. The proposed approach avoids the need to formalize the underlying inference problem whenever a solution evaluator is available. Controlling for inference cost, we find that Mind Evolution significantly outperforms other inference strategies such as Best-of-N and Sequential Revision in natural language planning tasks. In the TravelPlanner and Natural Plan benchmarks, Mind Evolution solves more than 98% of the problem instances using Gemini 1.5 Pro without the use of a formal solver.
Collections
Sign up for free to add this paper to one or more collections.
Summary
- The paper presents Mind Evolution, an evolutionary search strategy that iteratively refines candidate solutions via structured critic-author roles.
- It employs a custom evaluator and genetic algorithm principles, using Boltzmann tournament selection, to enhance natural language planning outputs.
- Experimental results show significant improvements over baselines, achieving near perfect success rates on planning benchmarks on tasks like TravelPlanner and Meeting Planning.
This paper introduces Mind Evolution, a technique designed to enhance the problem-solving capabilities of LLMs by applying more computational effort during inference (2501.09891). It tackles complex tasks, particularly natural language planning problems, where generating a correct solution is difficult, but evaluating a proposed solution is feasible programmatically. Mind Evolution employs an evolutionary search strategy operating directly on natural language representations of solutions.
The core idea is inspired by genetic algorithms. A population of candidate solutions (e.g., travel plans expressed in text) is maintained and iteratively improved across generations. The key components and implementation flow are:
- Population Initialization: The process starts by prompting an LLM to generate an initial set of diverse candidate solutions for the given problem.
- Fitness Evaluation: A crucial component is a custom-built, programmatic evaluator for the specific task. This evaluator must:
- Parse the natural language candidate solution.
- Score the solution based on how well it meets objectives and constraints (fitness score).
- Verify constraint satisfaction.
- Provide textual feedback detailing errors or constraint violations. The ablation studies show this textual feedback is critical for performance (2501.09891).
- Selection: Solutions are selected probabilistically based on their fitness scores (using Boltzmann tournament selection) to become "parents" for the next generation. Higher-scoring solutions are more likely to be chosen, but diversity is maintained.
- Recombination (Crossover & Mutation): Selected parent solutions are fed back into the LLM. Using specially designed prompts, the LLM is instructed to combine ideas from the parents and generate improved "child" solutions. This step inherently performs both crossover (mixing parent ideas) and mutation (introducing novel variations via the LLM's generation process).
- Refinement through Critical Conversation (RCC): This is a key process used both during initialization and recombination. It involves prompting the LLM to adopt two roles:
- Critic: Analyzes the input solution(s) and the evaluator's textual feedback, identifying flaws and suggesting improvements.
- Author: Takes the original solution(s), feedback, and the critic's analysis to propose a refined solution. This structured conversation loop (evaluate -> critique -> refine) helps the LLM focus on improving specific weaknesses. The ablation paper confirms the critic step significantly boosts performance (2501.09891).
- Island Model: To further enhance diversity and avoid premature convergence, Mind Evolution uses an island model. The population is split into sub-populations ("islands") that evolve semi-independently. Periodically:
- Migration: Top-performing solutions from one island are copied to the next, allowing good solutions to spread.
- Island Reset: Low-performing islands are reset and repopulated with diverse, high-quality solutions selected from the global population, potentially using the LLM itself to ensure diversity among the selected elites (2501.09891).
- Termination: The process repeats for a fixed number of generations (Ngens​) or until a solution meeting all constraints is found.
Practical Implementation Considerations:
- Evaluator Development: The main engineering effort lies in creating the task-specific evaluator. While potentially easier than building a solver, it requires careful parsing of the LLM's natural language output and implementing logic to check all constraints and objectives. The evaluator's quality directly impacts search effectiveness. Appendix A.2 provides details on the evaluators built for the benchmark tasks (2501.09891).
- Prompt Engineering: The performance heavily relies on the prompts used to guide the LLM for initialization, RCC (critic/author roles), recombination, and potentially island reset. The paper provides example prompts in Appendix A.1 (2501.09891). These often include general instructions, few-shot examples, the task description, parent solutions with feedback, and specific critical thinking instructions.
- Hyperparameter Tuning: Several hyperparameters influence the search (see Table 1 (2501.09891)), including the number of generations (Ngens​), islands (Nisland​), conversations per island (Nconvs​), refinement steps per conversation (Nseq​), reset frequency, migration size, etc. Default values are provided, but tuning may be necessary for optimal performance on new tasks or with different LLMs.
- Computational Cost: Mind Evolution uses significantly more inference computation than a single LLM call or even Best-of-N sampling. The paper provides detailed cost breakdowns (LLM calls, tokens, API cost) in Table 2 and Figures 7-9, 16 (2501.09891). Practitioners must weigh the improved success rate against the increased cost.
- LLM Choice & Two-Stage Approach: The choice of LLM (e.g., Gemini 1.5 Flash vs. Pro) impacts both cost and capability. The paper proposes a cost-effective two-stage strategy: run Mind Evolution with a cheaper model (Flash) first, and only use a more powerful model (Pro) on the instances that remain unsolved (2501.09891).
Applications & Performance:
Mind Evolution was evaluated on challenging natural language planning benchmarks: TravelPlanner [xie2024travelplanner] and Natural Plan (Trip Planning, Meeting Planning) [zheng2024natural].
- Results: It significantly outperformed baselines like 1-Pass generation, Best-of-N sampling, and Sequential Revision (similar to multi-turn Reflexion [shinn2024reflexion]) in terms of final success rate, often achieving this with comparable or lower token counts than sequential revision methods (Table 2 (2501.09891)).
- High Success Rates: Using Gemini 1.5 Flash, it achieved >95% success on TravelPlanner and Trip Planning validation sets, and 85% on Meeting Planning. The two-stage approach with Gemini 1.5 Pro pushed these rates to nearly 100% on TravelPlanner and Trip Planning, and over 98% on Meeting Planning (2501.09891).
- No Formal Solver: Notably, these results were achieved without requiring the LLM to translate the problem into a formal representation or using an external symbolic solver, unlike some prior high-performing approaches on these tasks [hao2024large]. This makes Mind Evolution applicable to domains where formalization is difficult or impractical.
In summary, Mind Evolution offers a practical framework for leveraging LLMs and inference-time compute to solve complex natural language-based problems, particularly planning and constraint satisfaction tasks. Its main advantage is the ability to operate directly in the natural language space, bypassing the need for formal problem specification, provided a programmatic solution evaluator can be implemented. The key implementation steps involve building this evaluator and carefully crafting prompts to guide the evolutionary search process driven by the LLM.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Follow-up Questions
We haven't generated follow-up questions for this paper yet.
Related Papers
- Large Language Models As Evolution Strategies (2024)
- From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models (2024)
- Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models (2024)
- Planning In Natural Language Improves LLM Search For Code Generation (2024)
- Mastering Board Games by External and Internal Planning with Language Models (2024)
Tweets
YouTube
HackerNews
- Evolving Deeper LLM Thinking (12 points, 0 comments)
- Evolving Deeper LLM Thinking (1 point, 0 comments)
- [Google DeepMind] Evolving Deeper LLM Thinking (316 points, 56 comments)
- [R] Evolving Deeper LLM Thinking (41 points, 4 comments)