EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Published 11 Jun 2026 in cs.AI and cs.CL | (2606.13662v1)

Abstract: LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper demonstrates that environment engineering, rather than agent micro-management, enables reproducible and state-of-the-art autonomous scientific discovery.
It introduces a unique three-stage loop—prepare, propose, implement—while enforcing detailed permissions, artifact, and budget controls to maintain research integrity.
The system achieves SOTA results in tasks such as circle packing and triangular matrix multiplication, highlighting efficiency and reproducibility without the need for additional training.

EurekAgent: An Environment-Engineered System for Autonomous Scientific Discovery

Motivation and Problem Statement

The paper introduces EurekAgent, a system for autonomous scientific discovery with LLM-based agents, explicitly shifting focus from agent workflow prescription to agent environment engineering. It posits that as general-purpose agent capabilities mature, the critical bottleneck transitions from algorithmic design to the engineering of environments—defining resources, constraints, and interfaces that amplify productive agent behaviors (exploration, artifact management, collaboration) while suppressing reward hacking, evaluation contamination, and procedural violations.

The central thesis is that reliable, reproducible, and traceable autonomous research depends on environment engineering, not micro-management of agent reasoning: permissions, artifact, budget, and human-in-the-loop mechanisms are pivotal for preserving research integrity when leveraging strong, off-the-shelf CLI (command-line interface) agents.

System Architecture and Environment Engineering Dimensions

EurekAgent orchestrates agent activity via a three-stage loop (prepare, propose, implement) but delegates methodological choices to individual agent sessions within a rigorously engineered environment. The system architecture encompasses the following:

1. Permissions Engineering: Fine-grained control over agents' access to compute, test data, file system, and internet resources is imposed (sandboxing via Docker, read-only evaluator interfaces, GPU allocation APIs). These controls prevent agents from tampering with evaluation scripts, copying peer solutions in the same round, or monopolizing hardware.

2. Artifact Engineering: The system leverages the filesystem (backed by Git history) as cumulative memory, tracking code evolution, solution manifests, experimental logs, and evaluator scores. This artifact persistence fosters traceability, recovery, and allows agents to learn from prior iterations.

3. Budget Engineering: Exploration is bounded by both wall-clock and API usage budgets. Agents are made time-aware via helper APIs and warnings, and API cost thresholds abort runs to enforce operational resource limits.

4. Human-in-the-loop Engineering: Transparent agent outputs, per-session logs, and visualized score evolution are exposed to users via both terminal and web interfaces, enabling real-time supervision, intervention, and review.

Experimental Evaluation

EurekAgent was evaluated across three domains, each with a metric-driven framework conducive to agentic optimization:

Mathematics Optimization

Tasks included circle packing, Erdos’ minimum overlap, and autocorrelation inequality. EurekAgent established new SOTA results in all tasks, outperforming previous best AI systems (including training-based methods) while remaining training-free. For the 26-circle packing problem, EurekAgent improved the best-known packing sum ($2.635999$) at an API cost below $11, demonstrating both computational efficiency and effectiveness of environment engineering.

Kernel Engineering

On the GPUMODE TriMul competition (triangular matrix multiplication, runtime minimization), EurekAgent generated multiple solutions surpassing the previous leaderboard champions and the test-time training-based TTT-Discover system. The best discovered kernel improved median runtime by 4.3% over top prior solutions, evidencing stable quality rather than stochastic luck.

Machine Learning Engineering (MLE-Bench)

EurekAgent was evaluated on a seven-task subset of MLE-Bench Lite (real-world ML competitions with held-out test sets). Using the open-source model GLM-5.1, it achieved an 85.71% "any medal" rate and a 71.43% gold-medal rate, outperforming agents running commercial closed models. This underscores the efficacy of environment engineering in ML domain adaptation and competitive modeling, even without proprietary model access.

Implications and Theoretical Considerations

The results strongly suggest that environment engineering—rather than prescriptive workflows or specialized agent behaviors—can unlock SOTA performance in metric-driven scientific discovery, leveraging general-purpose CLI agents. The architecture achieves research integrity, reproducibility, and rigorous evaluation without resorting to agent fine-tuning or reward shaping.

Practical implications for the AI community include:

Deployability: Environment-engineered frameworks allow safe integration of increasingly powerful LLM agents into scientific workflows without compromising evaluator protocols.
Extensibility: Artifact engineering and persistent memory enable long-horizon, collaborative, and recoverable agentic research, laying groundwork for broader, more open-ended AI-driven science.
Efficiency: Budget engineering features support operational continuity and optimize resource consumption, which is critical for large-scale, cost-sensitive research automation.

Theoretically, this paradigm may catalyze new research directions where autonomous agent systems are evaluated not by internal reasoning process but by their ability to reliably operate in constrained yet supportive environments. As agent capabilities approach human-level open-ended exploration, environment engineering becomes the primary lever for amplifying productive research and suppressing emergent failure modes such as reward hacking or procedural violations.

Limitations and Future Directions

While EurekAgent achieves impressive results on metric-driven tasks with executable evaluators, its environment-engineering framework may require adaptation for tasks requiring subjective judgments or open-ended question-answering. The system's effectiveness hinges on well-defined metrics and structured evaluation protocols.

Future developments are anticipated in:

Broader scientific domains, including less formally defined tasks
More complex environments supporting multi-agent collaboration, long-horizon research, and dynamic resource allocation
Robustness evaluation against sophisticated reward-hacking strategies and adversarial manipulations
Integration with emerging agent benchmarks for open-ended scientific discovery

Conclusion

EurekAgent demonstrates that environment engineering is sufficient to enable reliable, reproducible, and state-of-the-art autonomous scientific discovery using general-purpose LLM-based agents. By rigorously structuring permissions, artifacts, budgets, and human interfaces, the system transforms agent capability into scientific progress without reliance on workflow prescription or agent-specific fine-tuning. This strongly argues for prioritization of environment engineering research as the foundation for future autonomous research agents and scalable AI-driven scientific exploration.