Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Published 21 May 2026 in cs.MA, cs.AI, and cs.SE | (2605.22343v1)

Abstract: Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a formal framework that converts trial signals into agent behavior and harness adaptations to secure reliable research judgment.
It details a methodology with structured trials, explicit evidence maturity stages, and role segmentation to ensure adaptive evolution and auditability.
Audited experiments on the SIBYL system demonstrate a median one-iteration latency from trial signal to behavior change, effectively mitigating recurring failures.

Authoritative Essay on "Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators" (2605.22343)

Motivation and Problem Diagnosis

"Sibyl-AutoResearch" articulates an acute systems-level challenge in the emerging field of autonomous research: while present-day LLM-driven scientific agents can execute end-to-end workflows—hypothesis generation, code execution, experimentation, manuscript drafting—the critical bottleneck is not executable completeness, but the absence of systematic research judgment. The paper identifies pervasive failure modes in existing autonomous research agents: trial signals are not routed into subsequent constraints, weak evidence is elevated into confident claims, textual memory fails to alter future planning or validation, and repeated process failures do not trigger harness adaptations. This diagnosis reframes the problem as one of missing update paths from trial experience to future behavioral or infrastructural change.

Framework: Scientific Trial-and-Error Harnesses

The central contribution is the formalization of Scientific Trial-and-Error Harnesses as the enabling environment for agent-harness co-evolution. A harness is defined concretely: research state, roles, tools, memory, gates, artifact contracts, compute policies, and repair mechanisms. The harness is intended to facilitate bounded trials, preserve both positive and negative outcomes, and ensure trial history alters subsequent research actions and system mechanics.

Two auditable conversion units are introduced:

Trial-to-behavior conversion: A signal from a trial at iteration $t$ must produce a behavior change in a subsequent iteration $t+k$ (planning, validation, claim boundary, resource allocation, critique, writing, etc).
Trial-to-harness-behavior conversion: Recurring process failures (duplicate results, stale artifacts, unsupported statistics) must induce changes in harness parameters, gates, repair mechanisms, prompt overlays, or scheduler policies.

Sibyl-AutoResearch Design Commitments

The Sibyl-AutoResearch framework is distilled into seven harness functions corresponding to observable commitments:

Trial orchestration: Trials are structured with explicit dependencies, outputs, and abort criteria; evidence alters future trial plans.
Evidence maturity: Explicit states separate execution, pilot results, analysis-ready evidence, paper-ready evidence, and audited claims; claims only advance with maturity validation.
Traceability: Behavior updates are tied to artifacts, enabling reconstructability of rationale.
Routed memory: Lessons are normalized and injected into the roles needing them; memory is actionable, not simply preserved text.
Perspective separation: Distinct agent roles (optimist, skeptic, methodologist, supervisor, writer) ensure disagreements become actionable validation, plan mutations, or claim downgrades.
Resource-aware trial policy: Wasteful trials reshape resource allocation and sanity-check ordering.
Harness self-evolution: Process failures alter the harness itself, producing more robust evidence integrity.

These commitments are operationalized in a file-backed autonomous research system, SIBYL. SIBYL records research artifacts, workspace traces, plan evolution, memory overlays, and review outputs, allowing for retrospective audit of conversion units.

Evidence and Implementation in SIBYL

The paper proceeds with rigorous process audits on SIBYL workspaces. Eight high-confidence conversion events are hand-marked, with median latency of one iteration between signal and behavior update. The audit includes recovery from naturally occurring failures (duplicate results, confidence interval inversion, stale figures, feature-count mismatch, unsupported statistics); in each case, the harness blocks, downgrades, or routes the failure into repair tasks or claim downgrades. For example, in dynamic weight-decay experiments, controller instability and budget confounds trigger controller repair, stability tests, and narrowed claims in subsequent iterations.

Critical insights emerge from analysis of reviewer and supervisor artifacts: review scores are poor progress metrics, but objections (validation strength, claim scope, baseline adequacy, synchronization errors) are systematically converted into refinement actions, validation gates, and harness changes. Across traces, negative results and disagreement are handled not as dead-ends or rhetorical theater, but as evidence-boundary signals that re-configure subsequent science.

Contrasting with Prior Work

The diagnosis sharply distinguishes Sibyl-AutoResearch from prior literature on automated scientific discovery, agent laboratory systems, and metric-driven program evolution (e.g., (Lu et al., 2024, Yamada et al., 10 Apr 2025, Gottweis et al., 26 Feb 2025, 2602.07040, Mitchener et al., 4 Nov 2025, Novikov et al., 16 Jun 2025)). While these systems demonstrate executable workflow capacity, the Sibyl framework foregrounds the necessity for traceable agent-harness feedback that produces research judgment—an explicit operational difference—particularly in open-ended domains where objective functions are fragile or unreliable and evidence maturity is non-monotonic.

The paper also critically considers alternatives: manuscript quality as sole endpoint, objective-verifier systems as sufficient solution, and human-in-the-loop as default. It counters that high manuscript polish can mask evidence collapse, that verifiers are brittle under broken objectives, and that scalable auditability and integrity require system-level trace routes irrespective of ultimate human responsibility.

Implications and Speculation for Future Research

Practically, the Sibyl-AutoResearch paradigm advocates for auditable, self-evolving research infrastructures. In computational ML/AI domains, trace-backed harnesses are required for scalable research integrity, reproducibility, and negative result propagation. Theoretically, the agent-harness co-evolution model points toward autonomous scientific systems capable of defending evidence boundaries—blocking weak claims, routing negative results, and evolving procedures to minimize recurring failures. This approach suggests future autonomous researchers may function not merely as paper generators, but as systems whose trial experience is structurally constrained to improve both science and workflow reliability.

Key future developments may include:

Prospective evaluation protocols on held-out harnesses, annotated conversion events, and injected failures.
Harness-evolution tests in cross-project setups: failures in one project producing adaptive repair in subsequent projects.
Increasing granularity of role separation and memory routing to strengthen perspective authority and evidence maturity transitions.
Integration of process-driven governance constraints to mitigate scientific spam, metric gaming, and evidence overclaiming.

Conclusion

"Sibyl-AutoResearch" presents a rigorous rethinking of autonomous scientific research system design, prioritizing traceable, auditable feedback from trial-and-error experiences over mere workflow execution or manuscript generation. The SIBYL system substantiates the feasibility of agent-harness co-evolution: concrete trial signals and failure modes do alter subsequent agent behavior and harness infrastructure, preserving research judgment in autonomous environments. This framework lays critical groundwork for future auditable, self-improving AI research systems, emphasizing that evidential maturity and update-path integrity are foundational to autonomous scientific progress, rather than rhetorical polish or completion metrics.

Markdown Report Issue