Does extended DGM runtime surpass closed-source SWE-bench systems?

Determine whether increasing the number of iterations and compute allocated to the Darwin Gödel Machine—an open-ended, self-improving system that iteratively modifies its own code to design LLM-based coding agents—continues to yield further performance gains on SWE-bench and can exceed the performance of closed-source state-of-the-art SWE-bench systems.

Background

The Darwin Gödel Machine (DGM) is a self-referential system that edits its own codebase to improve a coding agent, using open-ended exploration over an archive of agents and empirical evaluation on benchmarks such as SWE-bench Verified. In reported experiments, the DGM improved performance from 20.0% to 50.0% on SWE-bench and produced agents competitive with open-source baselines, but it still lagged closed-source state-of-the-art systems.

Because the current implementation requires substantial compute—about two weeks and significant API costs for a single SWE-bench run—the authors explicitly pose whether simply running the DGM longer would continue improving performance sufficiently to surpass closed-source systems. This question targets the practical limits of continued self-improvement via extended runtime and compute allocation.

References

However, it still falls short of closed-source SoTA SWE-bench solutions. An open question is whether running the DGM for longer would continue to yield performance gains and eventually surpass closed-source solutions.

— Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents (2505.22954 - Zhang et al., 29 May 2025) in Section 6: Conclusion and Limitations

Does extended DGM runtime surpass closed-source SWE-bench systems?

Sponsor

Background

References

Related Problems