Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators

Published 24 Jun 2026 in cs.LG, cs.AI, cs.MA, and cs.NE | (2606.26294v1)

Abstract: Self-improving agents are state-of-the-art (SOTA) on agentic coding benchmarks and have recently been extended to general domains. However, their search methods generally assume a stationary evaluation criterion: a fixed verifier, benchmark, or labeled dataset that remains valid as the agent improves. This ignores a central feature of evolution: species adapt as their environments change with them. We aim to bring the same principle to recursive self-improvement, making evaluation part of the improvement loop and opening search to evolving evaluators, adversarial objectives, and dynamic utilities that may surpass static benchmarks. We introduce the Red Queen Godel Machine (RQGM), an evolutionary framework for recursive self-improvement under non-stationary utilities. The RQGM makes this possible through controlled utility evolution: search is organized into epochs with a fixed within-epoch evaluation criterion, while the utility can be updated at epoch boundaries, so self-improvement guarantees hold per epoch as the objective evolves across them. We begin by showing that even on verifiable coding tasks, the RQGM improves test pass rate over the prior SOTA by adding a complementary agent-as-a-judge code-review signal. This signal is cheaper and the RQGM uses 1.35x-1.72x fewer tokens. We then turn to scientific paper writing and reviewing, and Olympiad-level proof writing and grading, where the RQGM improves performance over prior self-improving agents: co-evolved writers reach 1.78x-1.86x higher acceptance rates under a diverse agent-as-a-judge panel, while co-evolved graders reach 9% higher ground-truth accuracy. In paper reviewing, the strongest baseline reviewer over-accepts AI-generated papers at up to 1.91x the human rate. The RQGM corrects this by introducing an adversarial objective that discovers reviewers equally stringent on AI and human work.

Summary

  • The paper introduces the RQGM framework that co-evolves agents and evaluators via epoch-based recursive self-improvement, overcoming the limits of fixed utility models.
  • The methodology employs controlled utility evolution with selective erasure and adversarial objectives, yielding up to 1.91× bias reduction and improved performance across tasks.
  • Empirical results in code generation, scientific writing, and mathematical proofs demonstrate enhanced efficiency and robust theoretical guarantees for progressive self-improvement.

The Red Queen Gödel Machine: Co-Evolving Agents and Their Evaluators

Introduction and Motivation

The Red Queen Gödel Machine (RQGM) introduces a principled framework for agentic recursive self-improvement under non-stationary, co-evolving utility functions. The central motivation stems from the observation that current self-improving AI systems overwhelmingly operate with stationary evaluation criteria—i.e., their objective function, benchmark, or verifier is externally fixed and remains invariant as the agent modifies itself. This paradigm stands in sharp contrast to the open-endedness and adaptability of biological evolution, where the fitness landscape is shaped by co-evolving competitors and adversarial dynamics. The RQGM addresses three intrinsic limitations of fixed evaluation: (1) its inapplicability to tasks lacking direct benchmarks (e.g., scientific writing), (2) inefficiency when evaluation is expensive or uninformative, and (3) vulnerability to benchmark saturation and reward hacking as agents adapt to static objectives.

Framework and Controlled Utility Evolution

The RQGM operationalizes recursive self-improvement with controlled, epochal evolution of the utility function. Search is partitioned into epochs, each with a frozen evaluator that provides a stationary utility signal. At epoch boundaries, a learned evaluator can be replaced by a challenger if the latter yields statistically superior performance on a held-out ground-truth anchor, operationalized via an ϵ\epsilon-best-belief score. When a promotion occurs, all utility records attributable to the displaced evaluator are erased (selective erasure), ensuring that per-epoch search is compatible with the self-improvement guarantees of fixed-criterion frameworks like HGM. This implementation supports both evaluator-dependent and evaluator-independent tasks within a unified, multi-agent metacognitive workspace, where both task agents and evaluators are subject to modification by a meta-agent.

Co-Evolution Mechanisms

Unlike previous frameworks (Darwin-Gödel Machine, Huxley-Gödel Machine, HyperAgents) that keep the utility signal fixed, RQGM enables multi-agent co-evolution: agents and their evaluators mutually bootstrap over evolutionary time. Evaluators themselves are LLM-based agentic processes, with learned behaviors and modifiable evaluation rubrics. Importantly, the multi-agent tree search maintains theoretical convergence guarantees per epoch, and amortizes the cost of utility transitions through exponentially-spaced checkpoints, yielding only linear (in search budget) additional overhead.

Regularization and adversarial objectives can be injected at epoch boundaries. In one concrete demonstration, the RQGM incorporates an adversarial objective that reduces the over-acceptance bias of LLM reviewers towards AI-generated scientific papers, producing evaluators that maintain consistent stringency between human and AI submissions—a failure mode not addressed by fixed benchmarks.

Empirical Evaluation

RQGM is empirically validated across three domains: verifiable code generation (Polyglot), scientific paper writing/review (APReS), and Olympiad-level mathematical proof writing/grading (IMO-GradingBench).

Verifiable Coding (Polyglot)

By co-evolving a code reviewer alongside the code-generating agent, RQGM augments the standard test execution utility with a learned code-review signal, resulting in higher test pass rates than HGM-H, while using 1.35×–1.72× fewer tokens. The reviewer enables cheaper, single-turn evaluation, whereas standard execution is multi-turn and computationally expensive.

Scientific Writing & Reviewing (APReS)

In tasks without objective evaluation, RQGM produces writers co-optimized with reviewers. Writers obtained acceptance rates of up to 1.86× higher (40.5% vs. 21.8%) than state-of-the-art HGM-H under panel evaluation. Crucially, the introduction of an adversarial objective during evaluator transitions produces reviewers that are as stringent on AI-generated as on human papers, sharply correcting the 1.91× over-acceptance rate observed in standard LLM-based reviewers.

Mathematical Proofs (IMO-GradingBench)

For Olympiad-level proofs, RQGM co-evolves both prover and grader. The co-evolved grader attains 9% higher ground-truth accuracy at 3× lower search cost compared to HGM-H. At increased search horizons, co-evolved provers surpass both SOTA baselines and crowd-engineered verification pipelines (IMO25), delivering higher mean scores and better Pass@6 rates, with remaining gaps on Pass@7 attributable to search budget limits.

Theoretical Guarantees

The RQGM framework is theoretically underpinned by a sequence of results:

  • Epoch-Local Validity: Fixed evaluator per epoch guarantees are inherited from HGM.
  • Anchor-Guided Evaluator Promotion: Evaluators are only replaced when empirical evidence (on held-out ground-truth) warrants it, and replacements are guarded by conservative lower-bound estimates on performance.
  • Amortized Bookkeeping: The cost of maintaining consistent utility statistics over evaluator transitions is linear in search budget with exponential checkpoints.

These properties ensure that self-improvement progresses under sound estimation and cannot regress due to unprincipled evaluator churn, although global convergence remains epoch-local due to the changing objective.

Practical and Theoretical Implications

From a practical standpoint, RQGM opens agentic search to domains without robust or cheap evaluation, reducing compute cost and enabling progress in scientific discovery and mathematical reasoning tasks. The framework also enables in-place debiasing of evaluators, mitigating self-referential failure modes inherent to LLM-based judges.

Theoretically, RQGM demonstrates that recursive self-improvement can be robustly extended beyond stationary utility models, laying groundwork for truly open-ended, co-evolutionary agentic systems. It links progress in AI safety (via guardrails such as selective erasure and anchor-based regularization) with rapid capability gains across modalities. However, improvement is bounded by anchor dataset quality and stationarity assumptions, and epoch-local guarantees cannot enforce global optimality.

Limits and Future Directions

Current limitations include reliance on the quality and calibration of anchor datasets (APReS, IMO-GradingBench), epoch-local validity of theoretical guarantees, and evaluation exclusively in intellectual-artifact domains. Extending the framework to cross-domain, longer-horizon co-evolution with adaptive meta-agent search and automated anchor generation is necessary for more robust, scalable open-ended AI. Furthermore, decoupling the framework from handcrafted meta-reasoning or scheduler components will require additional theoretical tools and more sophisticated guardrails.

Conclusion

The RQGM establishes co-evolutionary utility as a viable and efficient extension to recursive self-improvement in agentic AI. Empirical results demonstrate stronger and more efficient agents and evaluators, robust debiasing under adversarial objectives, and enhancements on tasks with and without direct benchmarks. By combining controlled utility evolution, epochal stationarity, and multi-agent co-evolution, RQGM provides a rigorous foundation for self-improving AI in dynamically evolving environments, and delineates a clear research agenda toward open-ended, automated machine reasoning and discovery.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What this paper is about (big picture)

This paper introduces the Red Queen Gödel Machine (RQGM), a way for AI systems to improve themselves over time by not only upgrading the “doers” (agents that write code, papers, or math proofs) but also upgrading the “judges” that score their work. Think of it like a never‑ending science fair where both the students and the judges get better each round. This solves a big problem: many creative tasks (like writing a paper or proving a theorem) don’t have a simple, perfect score you can check automatically. RQGM learns better judges as it goes, so improvement doesn’t stall.

The main questions the paper asks

  • Can learning a judge help even when we already have a ground-truth checker (like running code tests), especially if that checker is slow or expensive?
  • Can learned judges make progress possible in areas with no clear automatic score (like paper writing or proof writing)?
  • As judges get better over time, do they guide the “doers” to become better too?
  • Do the judges themselves actually improve and become fairer and more reliable?

How the method works (in everyday language)

Picture a tournament that runs in rounds (the paper calls them epochs):

  • During each round, there’s a fixed judge that scores all the work. Keeping the judge fixed per round makes the scoring fair and stable for that round.
  • In the background, new “challenger judges” are trained. At the end of the round, we compare the current judge to these challengers using a secret, trustworthy answer set (called a ground-truth anchor, like human-graded examples). If a challenger is statistically better, it becomes the new judge for the next round.

Two key ideas make this work:

  • Controlled utility evolution: Only switch judges at round boundaries, not in the middle. That keeps each round fair and comparable.
  • Selective erasure: If we replace the judge, we delete only the old scores that depended on the old judge. That way we don’t mix apples and oranges (scores from different judges).

How the search for better agents actually happens:

  • Family tree of versions: The system keeps a tree of agent versions (like a family tree). It tries edits to the code or strategy to make new “children,” evaluates them, and keeps the promising lines going.
  • Picking what to try next: It prefers branches that look more promising but still explores new ones. You can imagine drawing marbles from a bag where more promising branches have more marbles—this is a friendly way to think about “Thompson sampling.”
  • Conservative scoring: When choosing the current best agent, it uses a cautious estimate of performance (called “best-belief”), which gives credit only when the evidence is strong. Think “better safe than sorry.”
  • Multi-agent workspace: Each version doesn’t just include a single doer; it can include multiple roles—like a coder and a code reviewer—sharing the same codebase. An improvement can benefit more than one role at once (for example, a helper function that both coder and reviewer use).

Judges and tasks come in two types:

  • Evaluator-independent (has a clear, automatic checker): Example—coding problems with runnable tests.
  • Evaluator-dependent (needs a learned judge): Example—reviewing a research paper or grading a proof.

To keep things fair and avoid overfitting:

  • Training and judging are separated. The system learns on one set of examples and is ranked on a different, held‑out set. Final results are reported on a third, fully unseen test set.

Efficiency tricks:

  • The judge is only switched at checkpoints (spaced out in time) to keep costs down.
  • When a judge changes, old agent outputs can often be re-checked without re‑generating them, saving compute.

What the experiments did and what they found

The team tested RQGM in three areas:

  1. Coding (Polyglot benchmark)
  • What they did: Co-evolved a code-writing agent and a learned code reviewer. The code still had ground-truth tests, but the learned reviewer gave cheaper, faster guidance during search.
  • What they found: RQGM beat the previous best method’s pass rate (71.7% vs 69.9%) while using about 1.35×–1.72× fewer “search tokens” (a measure of compute cost). Because the reviewer and coder shared useful code, upgrades often helped both at the same time.
  1. Paper writing and reviewing
  • What they did: Co-evolved a paper writer and a paper reviewer. The reviewer was checked against a human-judged dataset (APReS). The writer was then guided by that improving reviewer.
  • What they found: Papers from the co-evolved writer were accepted much more often by strong, fixed reviewers—rising from 21.8% (previous best) to 40.5%.
  1. Olympiad-level math proof writing and grading
  • What they did: Co-evolved a proof writer and a proof grader. The grader was anchored to human scoring (e.g., full credit = 7/7).
  • What they found: The learned grader outperformed static baselines and did so with about 3× lower search cost than the previous best system.

Extra observations:

  • Curriculum-like effect: As judges get stricter each round, the system reorders which agents look best. This nudges the agents to learn harder things over time.
  • Debiasing: Some learned judges tend to be too kind to AI-generated text. RQGM fixed this by collecting “too-kind approvals” as adversarial examples and training later judges to avoid those mistakes—making judgments fairer to both human and AI writing.

Why these results matter

  • Works where no simple score exists: Many creative tasks don’t have a perfect, automatic checker. RQGM supplies its own evolving judge to push progress forward.
  • Speed and cost: Even when ground truth exists, a good learned judge can guide search more cheaply, saving compute while improving results.
  • More robust and fair judging: By co‑evolving judges and using adversarial examples, RQGM can reduce bias and reward hacking, making evaluations closer to what we really care about.
  • Shared benefits: When multiple roles share a codebase, improvements can help several parts of the system at once.

What this could mean for the future

RQGM points toward AI systems that can steadily improve themselves in open‑ended areas like science and math—without humans constantly rebuilding benchmarks. Agents and judges can bootstrap one another to reach stronger capabilities. The trade‑off is that guarantees about perfect convergence are looser than in fixed‑benchmark settings. Still, by changing judges only at clear checkpoints and carefully comparing them on trustworthy anchors, RQGM keeps improvement controlled and meaningful.

In short: RQGM makes self‑improving AI more practical and more general by letting both the doers and the judges learn together—much like how species evolve together in nature’s Red Queen race.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of concrete gaps and unresolved questions that future work could address to strengthen, generalize, and stress-test the Red Queen Gödel Machine (RQGM) framework.

  • Dependence on ground-truth anchors for evaluator replacement
    • How does performance degrade with sparse, noisy, or distribution-shifted anchors? Quantify sample complexity, noise tolerance, and anchor-size sensitivity for reliable evaluator promotion.
    • What alternatives work when no objective anchor is available (e.g., cross-play/tournaments, meta-judging, human-in-the-loop batched audits, or bootstrap tests), and how do they compare in reliability and cost?
  • Epoch-local stationarity assumption
    • Generated data distributions evolve as agents improve; formally characterize and detect violations of epoch-local stationarity and their impact on guarantees and empirical performance.
    • Develop diagnostics and control mechanisms (e.g., drift tests, gating, or adaptive epoch lengths) to keep within-epoch utilities effectively stationary.
  • Best-belief replacement criterion and statistical control
    • The choice of ϵ\epsilon in BBϵBB_\epsilon is under-specified: analyze sensitivity, calibration, and false-promotion risk, especially under sequential, multi-epoch comparisons (multiple testing).
    • Compare BBϵBB_\epsilon against alternative decision rules (e.g., Bayes factors, sequential probability ratio tests, or Q-tests with alpha-spending) that control family-wise error or false discovery over many epochs.
  • Selective erasure and recovery overhead
    • Theoretical overhead is linear under exponential checkpoints, but constant factors and practical latency are unquantified. Benchmark wall-clock impacts, cache sizes, and re-evaluation costs across scales.
    • Investigate smarter recovery policies (e.g., priority re-scoring, importance sampling, partial credit reuse) that reduce disruption while preserving stationarity.
  • Role weighting and potential negative transfer
    • Utilities are uniformly averaged across roles and tasks, which may misalign with real-world priorities and difficulty. Study learned or adaptive weighting, multi-objective optimization, or constrained optimization (e.g., lexicographic or fairness-constrained).
    • Quantify negative transfer across roles, characterize when co-evolution helps vs. hurts, and devise mechanisms to mitigate interference (e.g., decoupled training phases, gradient orthogonalization for shared code, or role-specific isolation toggles).
  • Collusion and reward hacking between agents and learned evaluators
    • Within an epoch, agents can overfit to or collude with the frozen evaluator. Develop defenses (e.g., randomized holdout queries, cryptographic commitments on evaluation seeds, adversarial red teaming, or ensembles with disagreement penalties).
    • Formalize and detect emergent collusion signals (e.g., anomalous agreement patterns, shortcut features, or suspicious artifact–score correlations).
  • Adversarial debiasing via replayed samples
    • The adversarial pool approach to reduce over-leniency risks overfitting to the harvested set. Validate across multiple distributions (human vs. AI-generated across domains/genres) and evaluate fairness metrics (e.g., equalized odds, calibration).
    • Explore principled adversarial training objectives (e.g., distributionally robust optimization, minimax formulations) with formal guarantees.
  • Binary outcome restriction and heterogeneous difficulty
    • The framework assumes binary outcomes and uniform task weighting. Extend to graded/continuous evaluations, structured rubrics, and task-difficulty-aware sampling (e.g., IRT-style models, variance-aware Thompson sampling).
  • Guarantees limited to per-epoch convergence
    • Provide end-to-end regret/convergence guarantees under utility transitions; identify sufficient conditions for monotonic global improvement and non-degeneracy across epochs.
  • Simultaneous multi-slot evaluator replacements
    • Though order-independence of erasure is stated, interactions among multiple evaluator-dependent roles (shared code, correlated artifacts) are untested. Empirically probe and theoretically characterize cross-slot interference.
  • Scaling to many roles and large archives
    • Analyze computational complexity, memory footprint of caches, and Thompson sampling scalability with many roles/tasks and deep trees. Investigate pruning, batched Thompson sampling, or bandit approximations to maintain responsiveness.
  • Anchor reliability and external validity
    • APReS and IMO-grade anchors may carry label noise or domain biases. Measure robustness to label noise, re-annotate subsets for reliability, and validate cross-domain generalization beyond the anchor’s distribution.
  • Foundation model and inference setting dependence
    • Results rely on a specific model (e.g., GPT-5.5 low). Test across open-source and frontier models, different decoding parameters, and context lengths to assess robustness and reproducibility.
  • Omission of long-horizon, heavy-benchmark domains
    • SWE-bench was excluded due to runtime. Evaluate RQGM on long-horizon, interactive, or high-latency domains to understand real-world scalability and the effectiveness of learned evaluators as cheap proxies.
  • Cost reporting and comparability
    • Blended token metrics vary across providers and over time. Standardize cost/energy/wall-clock reporting and study trade-offs under varying model pricing curves and hardware.
  • Safety and containment
    • Self-modifying agents with evolving evaluators pose safety risks. Specify sandboxing, permission boundaries, supply-chain security for code edits, and intervention triggers for anomalous behavior.
  • Hyperparameter sensitivity and tuning
    • System performance likely depends on expansion exponent, checkpoint ratio, and ϵ\epsilon. Provide ablations, auto-tuning strategies, and principled defaults with sensitivity analyses.
  • Data isolation and leakage audits
    • Although isolation is described, formal audits are not provided. Implement and report comprehensive leakage checks across training/validation/test and across epochs, including generated-artifact reuse and cache effects.
  • Evaluator overfitting to anchors
    • Repeated promotion against a fixed anchor risks anchor overfitting. Introduce rotating or stratified anchors, out-of-anchor validation, or meta-regularization to preserve generalization.
  • Transfer claims require broader evidence
    • The observed 90% shared-edit transfer in Polyglot may be anecdotal. Quantify transfer rates, variance across seeds/domains, and conditions for positive vs. negative transfer.
  • Handling nondeterminism and flaky evaluations
    • Many evaluators (e.g., code tests, LLM judges) are stochastic or flaky. Develop robustness strategies (replicated evaluations, confidence intervals in CMP, de-flaking protocols).
  • Stability of search dynamics under changing evaluators
    • Evaluate for oscillations, cycles, or premature stagnation after evaluator swaps. Study adaptive epoch lengths, warm-starting, or smoothing to stabilize progress.
  • Cross-epoch comparability for unanchored roles
    • Best-belief scores are not comparable across epochs for unanchored roles. Propose cross-epoch normalization, periodic scoring on shared reference sets, or calibration models enabling longitudinal comparisons.
  • Incumbent tie-breaking policy
    • Favoring incumbents on ties may introduce inertia. Test alternative tie-breakers or statistical equivalence tests to balance stability vs. innovation.
  • Shared-artifact dependencies under erasure
    • When roles share artifacts or code, erasing slot-dependent records may have unintended knock-on effects. Define and validate dependency-aware erasure policies that preserve correctness without excessive data loss.
  • Reproducibility and release artifacts
    • The paper references appendices for prompts and ablations; to support replication, release full code, seeds, evaluator snapshots per epoch, and cached artifacts to enable exact re-runs.

Practical Applications

Immediate Applications

The following applications can be deployed with current LLM capabilities and standard MLOps tooling, leveraging RQGM’s epoch-based evaluator replacement, multi-agent workspaces, and anchor-guided best-belief selection.

  • Software engineering: co-evolved code reviewers in CI/CD
    • Sector: Software
    • Description: Integrate a learned code reviewer that co-evolves with coding agents to score patches before expensive test runs, improving search efficiency and patch quality (as shown on Polyglot).
    • Tools/Products/Workflows: GitHub/GitLab app that adds an “evaluator slot” to PR pipelines; cache-and-rescore module for selective erasure; epoch scheduler; CRAVE-like label ingestion; best-belief dashboards.
    • Assumptions/Dependencies: Availability of historical PR acceptance labels or internal style guides as a ground-truth anchor; policy for auto-scoring PRs; cheap LLM inference; tests remain the ultimate arbiter.
  • Low-cost proxy evaluators for compute-heavy verifiers
    • Sector: Software, Robotics, Simulation
    • Description: Use a co-evolved “agent-as-a-judge” as a fast prefilter to reduce calls to slow verifiers (e.g., integration tests, long simulations).
    • Tools/Products/Workflows: Surrogate scoring service; adjustable acceptance thresholds per epoch; Thompson-sampling based scheduling to decide when to use the cheap proxy vs. ground truth.
    • Assumptions/Dependencies: Anchors that quantify agreement between proxy and ground truth; monitoring for drift; fallback-to-verifier gates.
  • Internal manuscript writing and self-review loops
    • Sector: Academia, R&D
    • Description: Co-evolve a paper-writing agent and a reviewer using internal or public anchors (e.g., APReS), improving acceptance likelihood and reducing human editorial load.
    • Tools/Products/Workflows: Lab-internal writing assistant; reviewer slot anchored to curated accept/reject datasets; epoch checkpoints that introduce adversarial pools to debias reviewers.
    • Assumptions/Dependencies: Access to de-identified decision data (or public anchors); editorial policies for AI-assisted drafting; IRB/ethics compliance.
  • Bias-calibrated LLM-as-a-judge services
    • Sector: Publishing, Platforms, Policy
    • Description: Deploy reviewers/raters that are debiased using RQGM’s adversarial replay (e.g., correcting over-acceptance of AI-generated text), improving fairness of automated judgments.
    • Tools/Products/Workflows: Reviewer slot with adversarial-sample regularizer introduced at epoch boundaries; cross-check against human-labeled anchors; bias/fairness reports per epoch.
    • Assumptions/Dependencies: High-quality, representative anchors; governance over what constitutes “bias”; capacity to monitor and roll back evaluator transitions.
  • Auto-grading and formative feedback for STEM proofs
    • Sector: Education
    • Description: Co-evolve a grader with a proof-writing agent anchored to human rubrics (e.g., IMO-GradingBench), delivering robust grading and adaptive feedback.
    • Tools/Products/Workflows: LMS plugin; grader slot with epoch-local stationarity; caching of student submissions for re-scoring after evaluator updates; curriculum-like progression via stricter evaluators.
    • Assumptions/Dependencies: Institution-approved rubrics as anchors; transparency to students; safeguards against evaluator drift during a course.
  • Automated triage of issues and PRs
    • Sector: Software
    • Description: Use evaluator slots trained on historical triage decisions to prioritize incoming issues/PRs and route to teams, reducing cycle time.
    • Tools/Products/Workflows: Backlog scoring service; anchor from historical “merged/closed/priority” labels; selective erasure to keep triage metrics consistent across evaluator updates.
    • Assumptions/Dependencies: Clean historical logs; agreement on triage definitions; privacy/security for repositories.
  • Dynamic evaluation harness for ML teams
    • Sector: AI/ML Engineering
    • Description: Replace static eval suites with RQGM’s controlled utility evolution—epoch-local, anchor-based evaluator replacement, and bounded recovery—to reduce reward hacking and stale benchmarks.
    • Tools/Products/Workflows: Eval orchestration library (evaluator slots, best-belief selection, epoch checkpoints, Thompson sampling over clade metaproductivity); anchor registries.
    • Assumptions/Dependencies: Curated anchor datasets; CI for evaluators; organizational acceptance of changing evaluation criteria at defined checkpoints.
  • Safety red-teaming with adversarial evaluators
    • Sector: AI Safety, Security
    • Description: Co-evolve adversarial evaluators and target agents to surface reward hacking and specification gaming; replay exploit samples at later epochs as an “adversarial pool.”
    • Tools/Products/Workflows: Red-team evaluator slots; adversarial objective toggled at checkpoints; selective erasure to avoid mixing utilities; audit logs for transitions.
    • Assumptions/Dependencies: Clear safety anchors (policy violations, jailbreak taxonomies); human oversight; incident response workflows.
  • Automated discovery/ideation ranking in labs
    • Sector: R&D, Pharma, Materials
    • Description: Use learned evaluators to rank generated hypotheses/experiments when direct metrics are weak or delayed, anchored to expert labels or proxy assays.
    • Tools/Products/Workflows: Hypothesis generator + evaluator slots; anchor curation from past expert decisions; staged promotion of evaluators; post-hoc scoring against fixed baselines.
    • Assumptions/Dependencies: Availability of expert-labeled anchors; domain-specific validation; careful separation of training/validation/test to prevent leakage.

Long-Term Applications

These applications require further research, scaling, domain integration, or governance frameworks before safe and reliable deployment.

  • Autonomous scientific discovery loops
    • Sector: Academia, Pharma, Materials
    • Description: End-to-end systems where generators propose hypotheses, methods, and analyses, while evaluators co-evolve to reflect scientific quality and novelty, anchored to expert panels or downstream success rates.
    • Tools/Products/Workflows: Multi-role workspaces (generator, reviewer, replicator, statistician); longitudinal anchors (replication outcomes); adaptive curricula via evaluator transitions.
    • Assumptions/Dependencies: High-quality, longitudinal anchors; wet-lab integration; strong guardrails for scientific validity and research ethics.
  • Editorial decision support at scale (conferences/journals)
    • Sector: Publishing, Policy
    • Description: Co-evolved reviewers triage submissions, with anchors from historical decisions and post-decision outcomes; adversarial debiasing to ensure parity across author groups and content sources.
    • Tools/Products/Workflows: Reviewer marketplaces with standardized anchor packs; epoch-governed evaluator updates; human-in-the-loop gating.
    • Assumptions/Dependencies: Governance for evolving criteria; transparency and appeal processes; anti-gaming safeguards.
  • Healthcare decision-support with evolving evaluators
    • Sector: Healthcare
    • Description: Co-evolve diagnostic/treatment agents with evaluators anchored to clinical outcomes, guidelines, and safety constraints, improving personalization while resisting reward hacking.
    • Tools/Products/Workflows: Evaluator slots aligned to guideline updates; outcome-tracking anchors; selective erasure and re-certification after evaluator transitions.
    • Assumptions/Dependencies: Regulatory approval; robust causal anchors; data privacy; post-market surveillance.
  • Autonomous robotics with learned reward evaluators
    • Sector: Robotics, Manufacturing
    • Description: Policies co-evolve against evaluators that capture implicit preferences (e.g., human comfort, wear-and-tear), anchored to safety tests and human feedback.
    • Tools/Products/Workflows: Sim-to-real pipelines with epoch checkpoints; reward model slots; replay of near-miss/incident logs as adversarial pools.
    • Assumptions/Dependencies: Safety certification; reliable anchors correlating with real-world outcomes; robust stationarity within epochs.
  • Finance: strategies co-evolving with risk evaluators
    • Sector: Finance
    • Description: Trading/optimization agents co-evolve with evaluators anchored to risk-adjusted returns and compliance constraints, reducing model overfitting to specific backtests.
    • Tools/Products/Workflows: Risk evaluator slots; rolling anchors (out-of-sample periods); bounded recovery to control re-scoring cost; compliance audit trails.
    • Assumptions/Dependencies: Strong anchors for risk and transaction costs; strict governance; latency-sensitive deployment constraints.
  • Adaptive regulatory and procurement evaluation frameworks
    • Sector: Government, Procurement
    • Description: Regulators adopt controlled utility evolution to keep AI evaluation criteria current, anchored to standardized tests and red-team findings; agencies procure systems with evaluator slots that can be updated under governance.
    • Tools/Products/Workflows: Public anchor registries; certified evaluator packs; epoch schedules aligned with policy cycles; mandatory selective erasure for auditability.
    • Assumptions/Dependencies: Legal standards; institutional capacity; interoperability norms for anchors/evaluators.
  • Energy systems optimization with safety evaluators
    • Sector: Energy, Infrastructure
    • Description: Grid/market optimization agents co-evolve with evaluators that encode stability, safety, and fairness constraints, anchored to regulatory and operational metrics.
    • Tools/Products/Workflows: Operator-in-the-loop anchors; fail-safe evaluator transitions; cross-season generalization checks.
    • Assumptions/Dependencies: High-fidelity anchors; reliability under rare events; regulatory approval.
  • Personalized assistants co-evolving with user preference evaluators
    • Sector: Consumer Software
    • Description: Assistants that adapt to individual preferences via co-evolved evaluators anchored to explicit feedback and implicit engagement signals, with epoch-based controls to prevent drift and reward hacking.
    • Tools/Products/Workflows: On-device or federated evaluator slots; privacy-preserving anchors; rollback after negative transitions.
    • Assumptions/Dependencies: Consent and privacy; robust preference elicitation; guardrails against manipulative optimization.
  • Training-time RLHF replacement with anchor-guided evaluator evolution
    • Sector: AI/ML
    • Description: Replace static reward models with co-evolved evaluators anchored to human preference datasets and adversarial test suites, reducing reward tampering and benchmark overfitting.
    • Tools/Products/Workflows: Evaluator slots in the RLHF loop; epoch checkpoints with adversarial replay; best-belief selection for reward model promotion.
    • Assumptions/Dependencies: High-quality, diverse human preference anchors; scalable human-in-the-loop; training stability under non-stationary utilities.

Notes on Cross-Cutting Assumptions and Dependencies

  • Anchors are pivotal: Each evaluator slot requires a stable, evaluator-independent ground-truth anchor (objective tests or human labels) for safe replacement; anchor quality determines reliability.
  • Epoch-local stationarity: Utilities must remain fixed within epochs; selective erasure is necessary to prevent mixing evidence across different utility functions.
  • Data isolation: Strict separation of training, validation, and test feedback is needed to avoid leakage and overfitting.
  • Compute and cost: Benefits rely on inexpensive LLM inference relative to task verification; token pricing and latency may affect feasibility.
  • Governance and safety: Changing evaluators requires auditability, rollback mechanisms, and clear policies—especially in high-stakes domains (healthcare, finance, critical infrastructure).
  • Bias and robustness: Anchors and evaluators must be checked for domain shift and bias; adversarial replay helps but does not eliminate the need for human oversight.

Glossary

  • Adversarial objectives: Training or evaluation goals that explicitly oppose or challenge the current model to reduce bias or overfitting. "debiasing over-lenient evaluators through adversarial objectives"
  • Adversarial-sample regularizer: An added penalty or training signal that uses adversarial examples to regularize evaluators or agents. "adding an adversarial-sample regularizer."
  • Agent-as-a-judge: A paradigm where an agent (often an LLM) evaluates artifacts or other agents’ outputs. "by adding an agent-as-a-judge code reviewer."
  • Amortized utility-transition cost: The idea that the overhead of switching evaluators and re-scoring is spread out over time to control total cost. "Amortized utility-transition cost."
  • Beta posterior: The posterior distribution (Beta) over a Bernoulli success probability given observed successes and failures. "the Beta posterior over the agent's successes SaS_a and failures FaF_a"
  • Best-belief score: A conservative utility estimate defined via the lower quantile of a Beta posterior; here, the ϵ\epsilon-quantile. "A challenger evaluator is promoted only when it raises an ϵ\epsilon-best-belief score on ground truth"
  • Checkpoint schedule: A planned set of moments (checkpoints) when evaluator replacements are considered to control re-evaluation costs. "To control the re-evaluation costs we introduce a checkpoint schedule."
  • Clade: In this context, the subtree rooted at a node in the search tree, including all its descendants. "clade C(a)C(a), defined as the subtree rooted at aa"
  • Clade metaproductivity (CMP): A utility that pools success/failure statistics over a node’s entire clade to guide search. "adopt clade metaproductivity (CMP) as the search utility"
  • Controlled utility evolution: A mechanism that allows the evaluation objective to change only at defined epoch boundaries to preserve per-epoch stationarity. "controlled utility evolution, which divides search into evolutionary epochs."
  • Darwin–Gödel Machine (DGM): An empirical system that replaces the Gödel Machine’s proof search with archive search over observed utility for self-improvement. "the Darwin-G\"odel Machine (DGM)~\citep{zhang2025darwin}"
  • Epoch-local stationarity: The property that evaluation criteria remain fixed within an epoch, enabling standard convergence guarantees. "Erasure preserves epoch-local stationarity:"
  • Evaluator-dependent role: A role whose performance is measured by a learned evaluator rather than an objective benchmark. "We call such roles evaluator-dependent, in contrast to evaluator-independent roles with a fixed benchmark."
  • Evaluator slot: A replaceable position in the system that holds the current evaluator for a role. "Each evaluator-dependent role is scored through an evaluator slot replaceable during search"
  • Evolutionary epoch: A time segment in search during which evaluators are frozen and the utility remains fixed. "divides search into evolutionary epochs."
  • Gödel Machine: A theoretical self-improving system that rewrites itself when it can prove the rewrite beneficial. "The G\"odel Machine~\citep{good1966speculations,schmidhuber2003godel} is a theoretical construct"
  • Ground-truth anchor: A fixed, evaluator-independent dataset (e.g., with objective or human labels) used to compare and promote evaluators. "on a ground-truth anchor"
  • Huxley–Gödel Machine (HGM): A variant that scores nodes by the utility of their entire descendant clade rather than only the node itself. "the Huxley-G\"odel Machine (HGM)~\citep{wang2025huxley}"
  • HyperAgents: A framework extending self-modification beyond coding by pairing meta-agents with task agents within each archive node. "HyperAgents lifts self-modification beyond coding by giving each archive node a meta-agent/task-agent pair."
  • Inverse regularized incomplete Beta function: The quantile function used to compute the best-belief score from a Beta posterior. "the inverse regularized incomplete Beta function"
  • Jeffreys intervals: Bayesian credible intervals for binomial proportions based on the Beta(1/2, 1/2) prior. "Uncertainty is reported with 95%95\% central Beta (Jeffreys) intervals"
  • Learned evaluator: An adaptive, model-based judge trained to assess artifacts when objective benchmarks are unavailable or insufficient. "co-evolved learned evaluators"
  • Meta-agent: A higher-level agent responsible for editing or orchestrating the workspace and its component agents. "Each node is allocated a meta-agent"
  • Multi-agent workspace: A shared, editable environment at each node containing multiple roles (task agents and evaluators). "each archive node is a multi-agent workspace"
  • Open-ended search: Search aimed at continuously generating novel, increasingly complex artifacts without a fixed end goal. "Open-ended search aims to keep generating novel artifacts"
  • Quality-diversity archives: Archives that maintain a diverse set of high-quality solutions across behavior or feature dimensions. "maintaining quality-diversity archives"
  • Reward hacking: Exploiting flaws in evaluation metrics to achieve high scores without genuinely solving the intended task. "leaves them open to reward hacking"
  • Selective erasure: Deleting only those utility records that depended on an evaluator being replaced, preserving unrelated evidence. "then applies selective erasure, discarding only utility records that depended on the replaced evaluator."
  • Self-play: Training by repeatedly competing against past or current versions of oneself. "self-play pits an agent against stronger versions of itself"
  • Thompson sampling: A Bayesian exploration strategy that samples actions according to their posterior probability of being optimal. "selected by Thompson sampling over clade metaproductivity"
  • UCB-Air gate: A scheduling rule that balances expansions and evaluations using an upper-confidence-bound style criterion. "a UCB-Air gate~\citep{NIPS200849ae49a2} adds a new node"
  • Utility transition: A boundary operation that replaces an evaluator and erases dependent records, enabling controlled changes to the objective. "We define a utility transition as a procedure that replaces a slot's evaluator and performs selective erasure"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 204 likes about this paper.