Huxley-Gödel Machine: Self-Improving Code Agents

Updated 27 October 2025

HGM is a self-improving coding agent that leverages lineage-based self-modification to optimize cumulative descendant performance.
It introduces the Clade-Metaproductivity (CMP) metric to aggregate and evaluate long-term improvements over evolutionary code variants.
Empirical results show that HGM achieves human-level coding performance on benchmarks with enhanced resource efficiency.

The Huxley-Gödel Machine (HGM) is a self-improving coding agent that operationalizes optimal self-modification through a practical lineage-based approach to agent development. Deriving theoretical inspiration from the Gödel Machine and Huxley's concept of clade, HGM addresses fundamental limitations in prior coding agent frameworks by introducing a lineage-aggregating metric for evaluating self-improvement potential. The result is a system that efficiently searches the space of codebase modifications—guided not merely by short-term benchmark performance but by the aggregated prospects for improvement across all descendants—yielding strong, transferable, and in some instances, human-level coding performance.

1. Conceptual Foundations and Motivation

HGM builds directly on the notion of self-improving machines as formalized by the Gödel Machine, originally conceived for general-purpose, utility-driven self-modification via proofs of reward improvement. In practice, prior coding agents used expansion strategies that favored single-step benchmark accuracy, implicitly assuming that immediate performance correlates with future improvement through subsequent modifications. The HGM introduces the insight that this assumption is empirically invalid: the "Metaproductivity-Performance Mismatch" establishes that short-term benchmark performance is not a reliable indicator of long-term self-improvement.

Leveraging Huxley’s clade concept, HGM shifts focus from individual agents to lineage aggregates, seeking those codebases whose descendants collectively achieve higher performance, even if the root agent itself does not excel immediately. This approach approximates a key theoretical attribute of the Gödel Machine: sustained, optimal self-improvement in an agent’s developmental trajectory.

2. Addressing the Metaproductivity-Performance Mismatch

Systematic empirical analysis reveals that coding agents with high benchmark accuracy often fail to seed promising descendants, whereas others with lower immediate scores may yield lineages of agents that surpass initial benchmarks over successive self-modification steps. This mismatch is termed "Metaproductivity-Performance Mismatch."

Traditional methods relied on node-centric evaluation metrics, which were insufficient for capturing an agent’s clade-level potential. This misalignment led to suboptimal allocation of resources to agent variants that did not drive genuine long-term improvement. HGM circumvents this by assessing metaproductivity not on static measures but via aggregation over evolutionary outcomes.

3. Clade-Metaproductivity Metric (CMP)

To formally quantify lineage potential, HGM introduces the Clade-Metaproductivity (CMP) metric. CMP computes the expected utility over the clade (the set of all descendants spawned through self-modification from a particular agent), effectively generalizing node-level scoring to tree-level aggregation.

Let $\mathcal{T}$ denote the current self-modification tree, $a$ an agent node, and $C(\mathcal{T}, a)$ its clade. The CMP for policy $\pi$ is:

$CMP_{\pi}(\mathcal{T}, a) = \mathbb{E}_{T_B \sim p_\pi(\cdot|\mathcal{T},a)} \bigg[ \max_{a' \in C(T_B, a)} U(a') \bigg]$

If $U(a')$ is the score function (e.g., accuracy on SWE-bench), in practice, CMP is estimated heuristically via empirical success and failure counts in the clade:

$\widehat{CMP}(a) = \frac{n^{C}_{success}(a)}{n^{C}_{success}(a) + n^{C}_{failure}(a)}$

This lineage-centric metric provides a robust prediction of an agent’s true self-improvement capacity via aggregation over observed evolutionary trajectories.

4. Tree-Guided Self-Modifying Search and Simulation of Gödel Machine Behavior

HGM operationalizes self-improvement by asynchronously expanding and evaluating a tree of self-modification candidates, guided by CMP estimates. Crucially, the process decouples the expansion (generation of new agent variants) from evaluation (performance measurement), enabling adaptive scheduling through techniques such as Thompson Sampling.

Under proven assumptions—budgeted expansion, repeatable evaluations, and reward confined to the terminal agent's utility—the paper demonstrates (see Theorem 1 (Wang et al., 24 Oct 2025)) that access to the true CMP oracle suffices to simulate Gödel Machine behavior in this context. CMP acts analogously to the state-action value function in a Markov decision process, ensuring that the selected expansion path optimizes long-term expected utility. The simulation does not require formal proofs for each modification as in traditional Gödel Machine instantiations, but rather relies on empirical value aggregation over evolutionary trees.

5. Empirical Performance and Dataset Transferability

Experimental results on established software engineering benchmarks—SWE-bench Verified and Polyglot—demonstrate that HGM substantially outperforms baseline agents including Darwin Gödel Machine and Self-Improving Coding Agent. For example, on a 60-task subset of SWE-bench Verified, HGM’s best-belief final agent achieves an accuracy of approximately $56.7\%$ , compared to $50.0\%$ or $53.3\%$ for the baselines, with improved initial-to-final accuracy transitions. Further, on Polyglot, HGM secures higher accuracy while consuming fewer computational resources.

The asynchronous, CMP-driven expansion and evaluation procedures contribute to both enhanced discovery of high-performing agents and heightened computational efficiency. These capabilities are validated quantitatively through accuracy metrics and resource consumption profiles reported in the benchmark results.

6. Human-Level Coding Agent Discovery

A significant milestone achieved by HGM is the discovery of coding agents whose performance matches human-engineered solutions on canonical benchmarks. When optimized with the GPT-5-mini backbone and evaluated on SWE-bench Lite—criteria for human-level comparison—the best HGM agent matches the best officially checked results for human-designed agents.

This outcome demonstrates the effectiveness of lineage-based self-improvement over node-centric selection strategies, highlighting that agents optimized for longer-term clade utility can reach the upper bounds established by human expertise.

7. Code Release and Open Research Directions

All code and implementation details for HGM are publicly available at https://github.com/metauto-ai/HGM. This release facilitates reproducibility and further investigation by the research community. The lineage-based self-improvement paradigm instantiated in HGM is open to extension in broader coding and agent design domains, and the system’s transferability across backbone models and datasets underscores its generalizability.

Ongoing and future directions include refinement of CMP estimation, exploration of alternative reward aggregation strategies, and theoretical advances in simulation fidelity under less restrictive assumptions. Potential integration with self-reflective formal methods—as exemplified by machine-assisted proofs of Gödel’s incompleteness using hereditarily finite sets (Paulson, 2021)—may yield extended frameworks for automated verification and design in agent self-modification systems.