Toward Autonomous Long-Horizon Engineering for ML Research
Abstract: Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces AiScientist, a computer system that tries to do a tough kind of work on its own: turning ML research ideas from papers into real, working code, running experiments, fixing problems, and improving results over many hours or even days. The authors call this “long‑horizon” ML research engineering because it takes many steps and a long time, not just a quick one‑shot action.
What questions were they trying to answer?
In simple terms, the paper asks:
- How can we build an AI system that makes steady, sensible progress on long, complex ML projects without forgetting what it did before?
- Is it better for AI agents to pass around chat messages, or to share and update real project files (like plans, code, logs, and results) as they work?
- Does organizing agents like a team—with a leader and specialists—help them do better over long projects?
How does their system work? (Everyday explanation)
Think of a school group project that lasts several days:
- You have a team leader who keeps track of the big picture and assigns tasks.
- You have specialists—someone good at reading and summarizing the source material, someone good at writing code, someone good at testing, and someone good at debugging.
- Instead of relying on long chat threads that get messy, your team uses a shared folder to store everything important: notes, plans, code, and experiment results. Everyone keeps reading and updating these files so the next person can pick up where you left off.
AiScientist works the same way.
Here are the key ideas:
- Thin control over thick state:
- “Thin control” means the team leader (called the Orchestrator) keeps only a short, clean summary of what’s going on and what to do next.
- “Thick state” means all the detailed stuff—paper notes, plans, code, logs, and results—live in a shared workspace on disk. This way, the real project information survives across hours or days, not just in chat.
- File‑as‑Bus (shared files as the “message bus”):
- paper_analysis/ holds what the paper means and what to aim for.
- submission/ holds the runnable code, setup scripts, and the main “run this” script.
- agent/ holds plans, to‑do lists, experiment logs, and detailed run outputs.
- Hierarchical team (like a small company):
- Orchestrator (team lead): decides the next stage—understand the paper, plan, implement code, run experiments, or debug.
- Specialists:
- Paper Comprehension specialist: turns the paper into clear, actionable notes.
- Prioritization specialist: makes a step‑by‑step plan and orders tasks by importance.
- Implementation specialist: writes or fixes the code.
- Experimentation specialist: runs the code, collects results, and reports what worked or broke.
- Small helper agents: do focused, one‑off tasks when needed.
Why this matters: Long projects often break because people (or agents) lose track of decisions and evidence. By storing everything important in files and organizing the team’s roles, AiScientist keeps progress steady and traceable.
What did they test and find?
The authors tested AiScientist on two challenging benchmarks:
- PaperBench: Tests whether an agent can reproduce the main results of real ML research papers from scratch—setting up environments, writing code, running experiments, and matching reported results.
- MLE‑Bench Lite: Tests whether an agent can take a competition‑style ML task and keep improving it over many experiment cycles.
Main results:
- On PaperBench, AiScientist scored higher than strong baselines by around 10 points on average and got closer to human PhD performance (under similar time limits).
- On MLE‑Bench Lite, AiScientist achieved 81.82% “Any Medal,” meaning it reached medal‑level performance on many tasks—again beating comparable systems they tested.
A concrete example: On a task about detecting insults in text, AiScientist ran 74 experiment cycles over about 23 hours and improved a key score (AUC) from 0.903 to 0.982—all without human help.
Why this is important: These results show that AiScientist can not only make a working version but also keep improving it over time, which is exactly what long‑horizon research work needs.
What made the biggest difference?
The authors did “ablation” tests—turning off certain features to see what breaks:
- Removing File‑as‑Bus (the shared‑files system) made performance drop a lot on both benchmarks. It especially hurt later‑stage improvements (like going from okay to great), which rely on carefully reusing logs and results across many runs.
- Using a simpler, non‑hierarchical agent setup (no clear team leader and specialists) also underperformed. So, both the shared workspace and the team structure matter.
Takeaway: Keeping durable project state in files and having a clear team structure both help AI agents stay organized and effective over long, multi‑step work.
Why does this matter?
In real ML research, you don’t succeed with one try. You read a paper, make choices, write code, set up the environment, run experiments, find bugs, fix them, and repeat—sometimes for days. Early mistakes often show up much later. AiScientist’s design—storing all important artifacts in a shared workspace and guiding work with a lightweight leader and focused specialists—helps AI agents handle this reality.
If systems like AiScientist continue to improve, they could:
- Speed up how fast new ML ideas get tested and reproduced.
- Make results more reliable and easier to verify, since all code and evidence are kept as files.
- Lower the barrier to doing serious ML research engineering, helping more people participate.
In short, the paper shows that long‑horizon ML research engineering is not just about smart reasoning in the moment—it’s a systems problem. You need the right teamwork and a solid way to remember and reuse everything you’ve done. AiScientist’s approach moves us meaningfully in that direction.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list synthesizes concrete gaps and unresolved questions to guide future research and engineering work on autonomous long-horizon ML research systems like AiScientist:
- External validity to real-world projects: Does performance translate to multi-week, multi-repo, multi-dependency research efforts with shifting goals, evolving specs, and human collaborators?
- Time- and scale-generalization: Results are reported under 24-hour, single-GPU (H20) budgets; how do outcomes change for longer horizons (week+), multi-GPU/multi-node training, HPC/Slurm queues, preemptions, and spot interruptions?
- Statistical robustness: No confidence intervals, repeated trials, or seed sweeps are reported; quantify run-to-run variance and statistical significance for all metrics on both benchmarks.
- Evaluation scale constraints: PaperBench full runs are expensive (≈$832 per 20-task eval), limiting sample sizes; can larger-scale evaluations or cheaper proxies validate generality without grading-model bias?
- Fairness of comparisons: The human baseline uses 48 hours vs. 24 for agents; isolate time budget effects and provide matched-budget human-agent comparisons.
- Sensitivity to backbone LLMs: Beyond Gemini-3-Flash and GLM-5, how does performance vary across open models, smaller models, or model-version drift? Quantify hallucinations, tool-use reliability, and degradation under weaker backbones.
- Orchestrator policy learning: The Orchestrator appears rule/prompt-based; can learned schedulers (RL/IL/bandits) optimize delegation, tool choice, and time allocation under budget constraints?
- Workspace map construction M(Wt): The map function’s design, size, update policy, and selection heuristics are unspecified; measure its fidelity, latency, and ablate map variants (flat index vs. hierarchical index vs. graph).
- Alternative state substrates: Compare File-as-Bus to database-backed state, knowledge graphs, vector-store memory, notebooks, or experiment trackers (e.g., MLflow/DVC/W&B) in terms of performance, traceability, and cost.
- File-as-Bus mechanics and robustness: Specify and test versioning, provenance, concurrent edits, conflict resolution, atomicity, rollback, corruption detection, and garbage collection for large artifact volumes.
- Permission scoping effects: Quantify how different write scopes, append-only logs, and sandbox levels affect throughput, interference, and error rates; ablate to identify minimal safe scoping.
- Cross-project memory and reuse: There is no mechanism for learning across tasks; evaluate reusable playbooks, case libraries, retrieval-augmented prior fixes, and meta-learning for faster convergence on new papers.
- Failure diagnosis quality: Current loops log failures but lack causal attribution; integrate and evaluate static/dynamic analysis, tracing, delta debugging, and fault localization to accelerate fixes.
- Experiment management and HPO: Introduce principled search (Bayesian optimization, ASHA, bandits) and early stopping; quantify gains per dollar/time under strict budgets.
- Verification and testing: Add automated unit/integration test synthesis, assertions, and invariant checks; study impact on regression prevention and reproduction fidelity.
- Reproducibility and determinism: Report seed control, dependency pinning, container digests, and supply-chain drift management; measure re-run reproducibility after days/weeks.
- Security posture: Assess risks from prompt injection via web/datasets, supply-chain attacks (pip/npm), credential handling, network egress, and sandbox isolation; define mitigations and red-team results.
- Compliance and licensing: Implement automated license checking for datasets/models and enforcement of blacklists; measure false positives/negatives and impact on task coverage.
- Domain and modality generalization: Evaluate on non-ML or less file-centric domains (systems, robotics, hardware, simulators), other languages (C++/Rust/Julia), and multimodal pipelines; identify necessary adapters.
- Resource-aware planning: Extend to multi-node training, memory-aware scheduling, data-parallel/ZeRO strategies, and mixed precision; measure orchestration quality under tight resource constraints.
- Human-in-the-loop efficacy: Quantify how minimal expert interventions (e.g., 15–30 minutes) affect outcomes; design escalation triggers, review checkpoints, and approval workflows.
- Cost–quality Pareto analysis: Provide end-to-end cost vs. quality trade-offs across models, orchestration settings, and ablations; identify regimes where simpler baselines suffice.
- Underspecification robustness: Stratify papers by ambiguity/severity and analyze failure rates; add mechanisms for uncertainty tracking and targeted information-seeking (e.g., literature pivoting).
- Long-horizon continuity and recovery: Evaluate pause/resume, checkpointing, crash recovery, and adaptation to environment drift (package updates, API deprecations) over multi-day runs.
- Benchmark grading reliability: GPT-5.4 grading may introduce bias; measure inter-rater agreement vs. humans and alternative graders; release adjudicated gold labels for a subset.
- Transparency of failure modes: Provide a taxonomy and qualitative analysis of common failure patterns with representative traces and artifacts to enable targeted system improvements.
- Concurrency control and scaling: Study scheduling of multiple Tier-2 workers, contention on shared artifacts, and parallel experiment orchestration; measure throughput vs. interference.
- Integration with MLOps/DevOps: Evaluate benefits of CI/CD, artifact registries, dataset versioning (DVC), experiment tracking (W&B/MLflow), and structured DAGs (Airflow/Prefect) vs. pure file-based coordination.
- Ethical and governance issues: Address authorship attribution, disclosure, data ethics, and misuse risks when autonomously reproducing or extending research; propose governance controls.
- Open-sourcing completeness: Ensure full release of prompts, code, Dockerfiles, seeds, evaluation harnesses, and raw traces to enable independent replication of results and ablations.
- Robustness to toolchain/model drift: Monitor performance over time as LLMs and packages update; define compatibility policies and drift alarms.
- Generality of File-as-Bus: Test whether artifact-mediated coordination remains effective when “state” is not naturally file-valued (e.g., interactive robotics), and identify required state abstractions.
- Hierarchical depth and role granularity: Systematically vary number of tiers and role specialization to find regimes where hierarchy helps or hurts; measure overhead from excessive decomposition.
- Budget-aware theoretical guarantees: Develop anytime planning policies with regret bounds or performance guarantees under fixed time/compute budgets; evaluate in stochastic environments.
- Privacy-preserving operation: Define procedures for handling proprietary datasets/code, including redaction, differential privacy, and secure enclaves; quantify performance impact.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage AiScientist’s artifact-mediated coordination (File-as-Bus), hierarchical orchestration, and evidence-driven experiment loops.
- Paper-to-Repository Reproduction Assistant (Software/ML; Academia; Publishing) — Automates from-scratch replication of ML papers, producing runnable repos, setup scripts, and experiment logs suitable for peer review or internal vetting; tools/products: GitHub Action/CLI, “Reproduce.sh” generator, workspace-map viewer for reviewers; assumptions/dependencies: access to strong LLMs and GPUs, permitted public resources (e.g., HuggingFace, GitHub), Docker sandboxing, journal/IP policies.
- Competition and Benchmarking Agent (Software/ML; Education) — A “continuous experimentation” agent that iteratively improves solutions for competition-style tasks (e.g., Kaggle-like or MLE-Bench), running propose–run–evaluate cycles and logging evidence; tools/products: MLflow/W&B integration, scheduler for experiment cycles, leaderboard adapters; assumptions/dependencies: compute budgets, stable data access, challenge rule compliance.
- MLOps Artifact Bus Layer (Software/DevOps) — Introduces a File-as-Bus “durable state” layer into existing MLOps stacks so every change, run, and failure is grounded in persistent artifacts; tools/products: Workspace Bus SDK, experiment ledger, workspace-map UI, connectors for MLflow/DVC; assumptions/dependencies: org adoption, filesystem/permissions design, security and PII controls.
- Enterprise Model Validation & Risk Audit (Finance) — Recreates and verifies internal models’ claims with an auditable trail to support model risk management and regulatory submissions; tools/products: “Model Reproducibility Auditor,” evidence-pack generator for governance; assumptions/dependencies: controlled access to production-like data, privacy/compliance requirements (e.g., SR 11-7), acceptance by risk committees.
- Healthcare AI Validation Sandbox (Healthcare) — Replicates reported clinical ML performance on de-identified/approved datasets with durable logs for audits and IRB reviews; tools/products: “Clinical AI Validator,” integration with secure sandboxes; assumptions/dependencies: HIPAA/GDPR compliance, IRB approvals, data-use agreements, compute isolation.
- Experiment Failure Triage for ML Pipelines (Software) — Uses logs and artifact evidence to diagnose failing training/evaluation runs and propose targeted patches to code/config; tools/products: “Experiment Triage Agent,” CI/CD plugin that opens PRs with fixes and impl_log/exp_log references; assumptions/dependencies: accessible logs/metrics, clean test harnesses, repository permissions.
- Course Lab Teaching Assistant for Reproducibility (Education) — Guides students through long-horizon projects with stepwise plans, artifact templates, and evidence-driven iteration; tools/products: LMS plugin, VS Code extension, standard workspace template (paper_analysis/, agent/); assumptions/dependencies: faculty oversight, LLM quotas, academic integrity policies.
- Pre-publication and Grant Reproducibility Checks (Publishing; Policy) — Journals/funders run standardized reproduction pipelines to verify key results before acceptance/funding; tools/products: “Replication-as-a-Service” portal, templated reports linked to durable artifacts; assumptions/dependencies: clear permitted-resources policy, compute budgets, authors’ cooperation when specs are underspecified.
- R&D Knowledge Continuity via Artifact Notebooks (Cross-sector) — Teams adopt File-as-Bus style lab notebooks so context and decisions persist across handoffs and staff changes; tools/products: artifact templates, workspace-map viewers, “decision logs”; assumptions/dependencies: change management, team workflow alignment, minimal required training.
- Automated Data/Environment Setup (Software/Cloud) — Generates idempotent environment setups, dataset/model acquisition scripts, and run entrypoints to reduce onboarding time; tools/products: Env-Setup Agent, IaC hooks (e.g., Terraform/Ansible), Dockerfile and reproduce.sh generators; assumptions/dependencies: network access, license compliance, cache/storage policies.
- Agent-as-Tool Orchestration SDK (Software) — Lets engineering teams call specialized agents as “tools” within existing orchestrators (Airflow, LangGraph, Flyte), keeping control thin and state thick; tools/products: orchestration adapters/operators, permission-scoped write APIs; assumptions/dependencies: stable LLM/tooling APIs, IT security approvals.
- Automated Model Card and Evidence Pack Builder (Software/AI Governance) — Compiles experiment logs, metrics, datasets, and code decisions into model cards and governance-ready evidence packs; tools/products: Model Card Builder backed by workspace artifacts; assumptions/dependencies: standardized metric definitions, traceable data lineage.
- Staging-Only Continuous Improvement for Production ML (Software/AIOps) — In a gated staging environment, the agent proposes/evaluates improvements and attaches evidence before human approval; tools/products: “Shadow Trainer” workflows, promotion gates based on exp_log; assumptions/dependencies: safe rollout policies, drift monitoring, strict separation from prod data.
- Open-Source Replication Templates and Leaderboards (Academia/OSS) — Community uses shared File-as-Bus templates to replicate papers and compare artifact-backed outcomes; tools/products: template repos, standardized rubric, public dashboards; assumptions/dependencies: maintainer bandwidth, governance and code-of-conduct, dataset licenses.
Long-Term Applications
The following applications need additional research, scaling, standardization, or integration with regulated and cross-organizational processes.
- Cross-Discipline Autonomous Research (Materials, Biology, Robotics) — Extend artifact-mediated loops from ML-only to lab or simulation workflows (e.g., robot policy tuning, materials discovery), integrating instruments/simulators as evidence sources; tools/products: instrument/simulator connectors, ELN integrations; assumptions/dependencies: hardware/simulator APIs, safety/regulatory approvals, high-fidelity feedback.
- Standardized Artifact Bus Protocol (Software; Policy) — An open standard for artifact-centric agent coordination (schemas, permissions, provenance) enabling interoperability across vendors and tools; tools/products: spec, validators, reference SDKs; assumptions/dependencies: community consensus, backing by standards bodies and cloud providers.
- Regulatory-Grade Provenance and Certification (Finance; Healthcare; Policy) — Cryptographically signed artifacts and tamper-evident audit trails to support “AI Reproducibility Certification” and regulatory audits; tools/products: provenance signers, secure artifact registries, compliance mappers; assumptions/dependencies: cryptographic infrastructure, auditor acceptance, mapped controls (e.g., ISO, FDA, EMA).
- Autonomous Replication Offices (Publishing; Academia) — Journals and funders run large-scale, semi-automated replication services for submissions and funded projects; tools/products: managed compute plus standardized pipelines integrated into submission systems; assumptions/dependencies: incentives, legal/IP policies, author community buy-in.
- Organization-Scale “Research OS” (Enterprise) — An internal operating system orchestrating hundreds of specialized agents over durable workspaces with quota, identity, and compute schedulers; tools/products: resource manager, workspace governance, multi-tenant security; assumptions/dependencies: cloud costs, strong IAM, central governance.
- Safety-Assured Autonomic Code and Experiment Changes (Software) — Agents integrate static analysis, type systems, and formal verification for patches and configs before execution; tools/products: verified compilers/analyzers, policy-based execution gates; assumptions/dependencies: formal specs/test oracles, verified toolchains.
- Federated/Privacy-Preserving Research Agents (Healthcare; Finance) — Artifact buses that operate across data silos via federated orchestration, secure enclaves, and differential privacy; tools/products: cross-site orchestrators, secure artifact exchange, audit overlays; assumptions/dependencies: legal DPAs, enclave infrastructure, standardization of cross-org protocols.
- Lifelong Institutional Memory for ML Systems (Enterprise) — Versioned artifact stores and knowledge graphs provide multi-year continuity for models and decisions; tools/products: versioned workspace DBs, semantic search over artifacts, lineage graphs; assumptions/dependencies: storage/retention policies, metadata hygiene, discovery UX.
- Automated Model Governance & Change Management (Enterprise; Policy) — Maps exp_log/impl_log and artifact diffs to change tickets, approvals, and controls for continuous compliance; tools/products: GRC integrations (e.g., ServiceNow, Archer), control mapping libraries; assumptions/dependencies: standardized controls, workflow integration, auditor sign-off.
- General-Purpose Autonomic MLOps (Cloud/SaaS) — End-to-end self-healing pipelines where agents implement–run–diagnose–patch under SLOs with rollback and safety nets; tools/products: autonomic controllers, failure taxonomy libraries, staged deployments; assumptions/dependencies: robust guardrails, reliable failure classification, high observability.
- Human–AI Mixed Teams with Progressive-Disclosure UX (Software) — Interfaces where humans quickly navigate workspace maps, spot evidence, and approve/override agent steps; tools/products: “Workspace Map Studio,” diff-and-evidence viewers, approval workflows; assumptions/dependencies: UX research, change management, training.
- Personalized Research Apprenticeships at Scale (Education) — Long-horizon agents act as individualized research mentors for capstone projects with faculty oversight; tools/products: course-level controls, workload budgeting, plagiarism/ethics guardrails; assumptions/dependencies: institutional policies, compute quotas, ethical frameworks.
- Cross-Organization Open Science Hubs (Academia; Policy) — Shared, searchable repositories of artifact-backed replications enable meta-analyses and evidence-driven policy; tools/products: open registries, query APIs, reproducibility badges; assumptions/dependencies: funding, IP/data-sharing agreements, curator communities.
- Sector-Specific Long-Horizon Optimizers (Energy; Robotics; Logistics) — Agents iteratively optimize controllers/schedules in high-fidelity simulators (e.g., grid ops, robot control), transitioning to real-world deployment with safety gating; tools/products: simulator connectors (Gazebo, CARLA, grid simulators), digital twins; assumptions/dependencies: simulator realism, safe sim-to-real transfer, regulatory constraints.
Notes on Feasibility and Dependencies Across Applications
- LLM Quality and Cost: Results depend on access to capable LLMs (e.g., Gemini-3-Flash, GLM-5) and sustained budgets for multi-hour runs.
- Compute and Isolation: GPU availability, containerization (Docker), and secure sandboxes are prerequisites, especially for regulated data.
- Data/Resource Access: Many use cases require permitted external resources; licensing and IP constraints may limit full automation.
- Security and Permissions: Permission-scoped writes and auditability are core to safe deployment; enterprise IAM integration is often mandatory.
- Underspecification and Human-in-the-Loop: Papers and specs are often incomplete; human oversight may be necessary for ambiguous decisions, especially in regulated settings.
- Standards and Adoption: Long-term value grows with standardization of artifact schemas, provenance, and audit practices across tools and organizations.
Glossary
- Above Median%: A benchmark metric indicating the proportion of tasks where performance exceeds the median baseline. "The gains are mirrored in Above Median\%, with a consistent 9.09-point improvement under both backbones."
- Ablation: An experimental method that removes or disables a system component to measure its effect on performance. "Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed."
- Agent-as-Tool: A design pattern where specialized agents are exposed as callable tools within an orchestrator’s action space, making delegation a first-class operation. "The key design choice is Agent-as-Tool."
- Any Medal%: In MLE-Bench, the percentage of tasks achieving any medal tier (Bronze, Silver, or Gold). "and achieves 81.82 Any Medal\% on MLE-Bench Lite."
- Artifact-mediated coordination: Coordination that relies on durable files and artifacts in a shared workspace rather than transient conversation. "AiScientist implements artifact-mediated coordination through a File-as-Bus protocol."
- AUC: Area Under the Receiver Operating Characteristic Curve; a scalar performance metric for classifiers. "raising validation AUC from 0.903 to 0.982 through 18 best-so-far updates."
- Backbone LLM: The primary LLM that underlies and powers an agentic system. "We instantiate AiScientist with two backbone LLMs, Gemini-3-Flash and GLM-5,"
- Controlled evaluation: A matched, standardized comparison setting that isolates the effect of system design choices. "but treat them separately from the controlled evaluation because they are not matched comparisons."
- Evidence-driven research-engineering loop: An iterative workflow where implementation changes are guided by experimental evidence and diagnostics. "AiScientist runs an evidence-driven research-engineering loop over the evolving workspace rather than a rigid one-pass pipeline."
- Execution contract: A concrete, ordered plan specifying milestones, dependencies, and priorities for implementation. "Prioritization Specialist: converts paper understanding into an ordered execution contract."
- File-as-Bus: A coordination protocol where shared files serve as the primary communication and state substrate across agents. "AiScientist implements artifact-mediated coordination through a File-as-Bus protocol."
- Hierarchical orchestration: A control strategy where a top-level orchestrator delegates tasks to specialists across stages of the workflow. "AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace"
- MLE-Bench Lite: A benchmark that evaluates sustained improvement on competition-style ML tasks. "MLE-Bench Lite complements PaperBench by focusing on sustained experiment improvement on competition-style ML tasks."
- PaperBench: A benchmark for from-scratch replication of published ML papers under time and resource constraints. "We formulate the long-horizon ML research engineering task through PaperBench"
- Paper-to-code: The task of translating scientific papers into executable code or repositories. "paper-to-code tasks study how agents can translate papers into repositories or initial implementations with higher fidelity"
- Paper-to-repository synthesis: Automated creation of runnable repositories from research papers. "adjacent research-agent tasks such as paper-to-repository synthesis and optimization of runnable ML pipelines"
- Permission-scoped coordination: A policy where agents have write access only to specific workspace regions aligned with their roles. "AiScientist further enforces permission-scoped coordination: each Tier-1 specialist receives write access only to the regions required by its role, while shared logs remain append-only and iteration-structured."
- Progressive disclosure: Providing a lightweight overview first (e.g., a map), with details retrieved on demand. "This File-as-Bus design enables progressive disclosure: agents start from the map, read task-relevant artifacts on demand, and write back durable analyses, code, and logs, preserving continuity across long-horizon research-engineering loops."
- Replication rubric: A structured scoring schema used to assess the fidelity of reproducing a paper’s results. "the best reported agent achieves only 21\% of the replication rubric"
- State continuity: The preservation of evolving project state across iterations so later decisions remain coherent. "State Continuity: Each round of implementation and experimentation produces code, configurations, logs, results, and diagnostic evidence that later decisions must correctly interpret and build on."
- System of record: The authoritative store of project state and artifacts used for coordination and resumption. "The workspace is not passive storage; it is the system of record."
- Tier-0 Orchestrator: The top-level controller that maintains stage-level directives and coordinates specialists. "A Tier-0 Orchestrator keeps thin control through stage-level directives, concise summaries, and a workspace map,"
- Tier-1 specialist: A primary specialist agent responsible for a major stage (e.g., comprehension, implementation, experimentation). "while Tier-1 specialists and optional Tier-2 subagents coordinate through a permission-scoped workspace that serves as the system of record."
- Tier-2 subagent: A tightly scoped helper agent invoked by a specialist for focused subtasks within a local horizon. "Tier-2 subagents are tightly scoped leaf workers created within a specialist's local horizon for focused subtasks such as structure extraction, algorithm and baseline analysis, environment setup, resource download, or exploratory investigation."
- Underspecification: A condition where the research specification omits important details, requiring inference or external sourcing. "Underspecification: In practice, the research specification is typically underspecified rather than a complete blueprint."
- Valid Submission: A benchmark metric indicating that a solution executes successfully end-to-end. "Removing File-as-Bus leaves Valid Submission and Bronze largely intact, but causes much larger losses on stronger outcome metrics, including Above Median, Silver, Gold, and Any Medal."
- Workspace map: A compact textual index of the shared workspace’s major artifact regions and roles. "AiScientist constructs a compact workspace map:"
Collections
Sign up for free to add this paper to one or more collections.
