- The paper demonstrates that environment engineering, rather than agent micro-management, enables reproducible and state-of-the-art autonomous scientific discovery.
- It introduces a unique three-stage loop—prepare, propose, implement—while enforcing detailed permissions, artifact, and budget controls to maintain research integrity.
- The system achieves SOTA results in tasks such as circle packing and triangular matrix multiplication, highlighting efficiency and reproducibility without the need for additional training.
EurekAgent: An Environment-Engineered System for Autonomous Scientific Discovery
Motivation and Problem Statement
The paper introduces EurekAgent, a system for autonomous scientific discovery with LLM-based agents, explicitly shifting focus from agent workflow prescription to agent environment engineering. It posits that as general-purpose agent capabilities mature, the critical bottleneck transitions from algorithmic design to the engineering of environments—defining resources, constraints, and interfaces that amplify productive agent behaviors (exploration, artifact management, collaboration) while suppressing reward hacking, evaluation contamination, and procedural violations.
The central thesis is that reliable, reproducible, and traceable autonomous research depends on environment engineering, not micro-management of agent reasoning: permissions, artifact, budget, and human-in-the-loop mechanisms are pivotal for preserving research integrity when leveraging strong, off-the-shelf CLI (command-line interface) agents.
System Architecture and Environment Engineering Dimensions
EurekAgent orchestrates agent activity via a three-stage loop (prepare, propose, implement) but delegates methodological choices to individual agent sessions within a rigorously engineered environment. The system architecture encompasses the following:
1. Permissions Engineering: Fine-grained control over agents' access to compute, test data, file system, and internet resources is imposed (sandboxing via Docker, read-only evaluator interfaces, GPU allocation APIs). These controls prevent agents from tampering with evaluation scripts, copying peer solutions in the same round, or monopolizing hardware.
2. Artifact Engineering: The system leverages the filesystem (backed by Git history) as cumulative memory, tracking code evolution, solution manifests, experimental logs, and evaluator scores. This artifact persistence fosters traceability, recovery, and allows agents to learn from prior iterations.
3. Budget Engineering: Exploration is bounded by both wall-clock and API usage budgets. Agents are made time-aware via helper APIs and warnings, and API cost thresholds abort runs to enforce operational resource limits.
4. Human-in-the-loop Engineering: Transparent agent outputs, per-session logs, and visualized score evolution are exposed to users via both terminal and web interfaces, enabling real-time supervision, intervention, and review.
Experimental Evaluation
EurekAgent was evaluated across three domains, each with a metric-driven framework conducive to agentic optimization:
Mathematics Optimization
Tasks included circle packing, Erdos’ minimum overlap, and autocorrelation inequality. EurekAgent established new SOTA results in all tasks, outperforming previous best AI systems (including training-based methods) while remaining training-free. For the 26-circle packing problem, EurekAgent improved the best-known packing sum ($2.635999$) at an API cost below $11, demonstrating both computational efficiency and effectiveness of environment engineering.
Kernel Engineering
On the GPUMODE TriMul competition (triangular matrix multiplication, runtime minimization), EurekAgent generated multiple solutions surpassing the previous leaderboard champions and the test-time training-based TTT-Discover system. The best discovered kernel improved median runtime by 4.3% over top prior solutions, evidencing stable quality rather than stochastic luck.
Machine Learning Engineering (MLE-Bench)
EurekAgent was evaluated on a seven-task subset of MLE-Bench Lite (real-world ML competitions with held-out test sets). Using the open-source model GLM-5.1, it achieved an 85.71% "any medal" rate and a 71.43% gold-medal rate, outperforming agents running commercial closed models. This underscores the efficacy of environment engineering in ML domain adaptation and competitive modeling, even without proprietary model access.
Implications and Theoretical Considerations
The results strongly suggest that environment engineering—rather than prescriptive workflows or specialized agent behaviors—can unlock SOTA performance in metric-driven scientific discovery, leveraging general-purpose CLI agents. The architecture achieves research integrity, reproducibility, and rigorous evaluation without resorting to agent fine-tuning or reward shaping.
Practical implications for the AI community include:
- Deployability: Environment-engineered frameworks allow safe integration of increasingly powerful LLM agents into scientific workflows without compromising evaluator protocols.
- Extensibility: Artifact engineering and persistent memory enable long-horizon, collaborative, and recoverable agentic research, laying groundwork for broader, more open-ended AI-driven science.
- Efficiency: Budget engineering features support operational continuity and optimize resource consumption, which is critical for large-scale, cost-sensitive research automation.
Theoretically, this paradigm may catalyze new research directions where autonomous agent systems are evaluated not by internal reasoning process but by their ability to reliably operate in constrained yet supportive environments. As agent capabilities approach human-level open-ended exploration, environment engineering becomes the primary lever for amplifying productive research and suppressing emergent failure modes such as reward hacking or procedural violations.
Limitations and Future Directions
While EurekAgent achieves impressive results on metric-driven tasks with executable evaluators, its environment-engineering framework may require adaptation for tasks requiring subjective judgments or open-ended question-answering. The system's effectiveness hinges on well-defined metrics and structured evaluation protocols.
Future developments are anticipated in:
- Broader scientific domains, including less formally defined tasks
- More complex environments supporting multi-agent collaboration, long-horizon research, and dynamic resource allocation
- Robustness evaluation against sophisticated reward-hacking strategies and adversarial manipulations
- Integration with emerging agent benchmarks for open-ended scientific discovery
Conclusion
EurekAgent demonstrates that environment engineering is sufficient to enable reliable, reproducible, and state-of-the-art autonomous scientific discovery using general-purpose LLM-based agents. By rigorously structuring permissions, artifacts, budgets, and human interfaces, the system transforms agent capability into scientific progress without reliance on workflow prescription or agent-specific fine-tuning. This strongly argues for prioritization of environment engineering research as the foundation for future autonomous research agents and scalable AI-driven scientific exploration.