Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents (2505.22954v1)

Published 29 May 2025 in cs.AI

Abstract: Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G\"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G\"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jenny Zhang (10 papers)
  2. Shengran Hu (8 papers)
  3. Cong Lu (23 papers)
  4. Robert Lange (1 paper)
  5. Jeff Clune (65 papers)

Summary

The paper introduces the Darwin G\"odel Machine (DGM) (Zhang et al., 29 May 2025 ), a novel system designed for the open-ended evolution of self-improving agents. It aims to address the limitation of most current AI systems, which are built on fixed architectures and require significant human intervention for substantial improvements. Inspired by the theoretical G\"odel machine and Darwinian evolution, the DGM empowers AI agents to autonomously modify their own source code and empirically validate these changes through performance on coding benchmarks.

Unlike the original G\"odel machine which requires formal proofs of beneficial self-modifications (a practical impossibility for complex systems), the DGM relies on empirical evaluation. This mirrors biological evolution, where changes are tested against the environment. The DGM's second key inspiration from Darwinian evolution and open-endedness research is the maintenance of an archive of all previously generated agents. This archive serves as a source of diverse "stepping stones" for future improvements, preventing the system from getting stuck in local optima and promoting continuous exploration.

The DGM process is iterative:

  1. It starts with an initial coding agent implemented in Python, utilizing frozen foundation models (FMs) and tool use (like Bash execution and file editing).
  2. In each iteration, parent agents are selected from the archive, biased towards those with higher performance and fewer offspring with codebase-editing capabilities.
  3. Selected parents analyze their past performance logs on coding benchmarks and propose potential improvements to their own codebase.
  4. The parent agent then attempts to implement this proposed improvement, generating a new child agent with modified code.
  5. The child agent is evaluated on coding benchmarks to estimate its coding capabilities.
  6. Only agents that successfully compile and retain the ability to edit code are added to the archive. This cycle of selection, self-modification (which is itself a coding task performed by the agent on its own code), and evaluation allows the DGM to recursively improve its ability to solve coding problems and, implicitly, its ability to self-improve.

The paper demonstrates the DGM's effectiveness on two coding benchmarks: SWE-bench [jimenez2024swebench] and Polyglot [gauthier2024polyglot].

  • On SWE-bench Verified, the DGM improved the initial agent's performance from 20.0% to 50.0\% over 80 iterations.
  • On Polyglot, performance increased from 14.2% to 30.7% on the full benchmark.

These results are shown to significantly outperform two baselines:

  • DGM without self-improvement (where the meta-agent for modification remains fixed as the initial agent).
  • DGM without open-ended exploration (where only the most recent agent is kept, lacking the archive for diverse branching).

This highlights that both the recursive self-improvement loop and the open-ended exploration mechanism (via the archive) are crucial for sustained progress. The performance of the best DGM-discovered agent on SWE-bench is comparable to state-of-the-art open-source human-designed agents, while on Polyglot, it surpasses the performance of Aider, a widely used handcrafted agent continuously developed over a year.

The DGM automatically discovered various improvements to the coding agent's design and workflow, such as:

  • Enhancing the edit tool with more granular operations like viewing specific line ranges and performing string replacements instead of just overwriting entire files.
  • Implementing multi-attempt strategies for solving tasks.
  • Introducing a mechanism akin to peer review, where multiple candidate solutions are generated and the best one is selected.
  • Improving context management by incorporating previous attempts and their results into subsequent prompts.

The paper also investigates the transferability of the discovered improvements, showing that the DGM-generated agents perform better than the initial agent when using different FMs (Claude 3.7 Sonnet, o3-mini) than those used during the DGM run. Furthermore, improvements learned on Python tasks in Polyglot demonstrated robustness and transferred well to tasks in other languages (Rust, C++, Go, etc.).

Implementation Considerations:

  • Codebase: The agents modify their own Python code.
  • Dependencies: The agents rely on frozen FMs for their core reasoning and code generation capabilities and utilize tools like Bash and file editors.
  • Computational Resources: Running the DGM is computationally expensive. A single 80-iteration run on SWE-bench is estimated to cost around USD 22,000 and take approximately 2 weeks, highlighting the need for efficiency improvements.
  • Evaluation Strategy: A staged evaluation on subsets of benchmarks helps manage computational costs, focusing more extensive evaluation on promising candidates.
  • Parent Selection: A mechanism balancing performance and novelty (number of functioning children) is used to select agents from the archive for self-modification, promoting both exploitation and exploration.
  • Self-Improvement Prompting: A specific prompt structure is used to feed evaluation logs to a high-capability FM (o1 was used) to diagnose failures and suggest concrete implementation improvements phrased as problem descriptions for the agent to solve.

Safety Discussion:

The paper proactively discusses the significant safety implications of self-improving AI. Potential risks include the system optimizing for benchmark metrics in ways that introduce vulnerabilities or misaligned behaviors, and the increasing complexity/uninterpretability of the autonomously generated code.

Current safeguards in the DGM implementation include:

  • Sandboxing: All agent execution and self-modification occur within isolated environments.
  • Time Limits: Strict time limits on execution to prevent unbounded behavior.
  • Scope Limitation: The current scope is limited to modifying the agent's own Python codebase to improve performance on coding benchmarks.
  • Monitoring and Traceability: The DGM archive provides a full lineage of modifications for human review.

A supplementary case study on solving FM hallucination demonstrates that the DGM can be applied to safety-related objectives but also highlights the risk of "objective hacking," where the agent optimizes the measurable metric rather than truly solving the underlying problem, similar to reward hacking in reinforcement learning. The authors argue that self-improvement could potentially be directed towards enhancing safety and interpretability if these properties were included in the evaluation criteria.

Limitations and Future Work:

Current limitations include the DGM being constrained by the capabilities and costs of the underlying FMs, the significant computational expense, and the scope being limited to modifying agents built around frozen FMs for coding tasks.

Future directions include:

  • Running the DGM for longer durations to see if it can surpass closed-source SoTA agents.
  • Improving computational efficiency and integrating better reasoning capabilities into the agents.
  • Extending self-modification to include training/fine-tuning the underlying FMs.
  • Developing self-improving systems for domains beyond coding.
  • Exploring alternative approaches where the target task distribution co-evolves with the agent, removing the constraint of a fixed benchmark objective.
  • Continued research into safely navigating the development of self-improving AI systems and AI-Generating Algorithms.

In conclusion, the DGM represents a significant step towards automating AI development by enabling systems to self-improve through empirical validation and open-ended exploration of their own codebase, showcasing the potential for continuous, self-accelerating innovation while emphasizing the crucial need for careful consideration of safety throughout development.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com