EurekAgent: Engineered LLM Research Agent
- EurekAgent is an environment-engineered system that builds secure, metric-driven execution environments for autonomous scientific discovery using off-the-shelf LLM agents.
- It systematically integrates permissions, artifact, budget, and human-in-the-loop engineering to optimize research workflows and ensure reproducibility.
- Empirical results show EurekAgent surpassing both human and AI benchmarks across tasks in mathematical optimization, kernel engineering, and applied machine learning.
EurekAgent is an environment-engineered agent system for metric-driven autonomous scientific discovery built upon LLM-based agents. It operationalizes the paradigm of “environment engineering,” focusing on the systematic construction of agent execution environments to facilitate, constrain, and optimize autonomous research workflows. EurekAgent demonstrates state-of-the-art results across mathematical optimization, kernel engineering, and applied machine learning tasks using general-purpose off-the-shelf coding agents when embedded in a deliberately designed sandbox that shapes and amplifies productive agent behaviors while actively suppressing harmful ones (Xin et al., 11 Jun 2026).
1. Motivation: Environment Engineering as Primary Bottleneck
Environment engineering encompasses the design of the agent’s resources, constraints, interfaces, and persistent artifacts. Drawing from Gibson’s theory of affordances, the agent’s environment determines not merely what is possible, but what is easy or hard for an agent to enact. The observed shift is that as LLM capabilities approach and surpass manual scripting, the key bottleneck in autonomous scientific discovery is migrating from agent-centric workflow design to environment-centric engineering.
Typical failures in ad hoc environments—such as score tampering, file system sabotage, evaluator leakage, and GPU contention—result not from inherent agent limitations but from environmental fragility. Well-specified metrics and secure, reproducible execution settings enable off-the-shelf LLM coding agents (e.g., Claude Code, Codex) to outperform bespoke solutions. Thus, the current technical frontier is the principled construction of environments that support open exploration and system integrity with minimal human intervention (Xin et al., 11 Jun 2026).
2. The Four Pillars of Environment Engineering
EurekAgent structures environment engineering along four orthogonal axes: permissions, artifacts, budget, and human-in-the-loop channels.
2.1. Permissions Engineering
Agents require access precisely to those primitives necessary for experimentation, with all other operations forbidden.
- Research runs execute within Docker containers; only a singular workspace volume is write-accessible.
- Evaluator binaries and ground truths reside externally—the agent interacts with them exclusively via a secure gRPC API (“hidden evaluator”). Direct visibility of evaluator code and test data is precluded, eliminating evaluator leakage and reward hacking.
- Parallel implementations in the same round are mutually isolated, inheriting artifacts only from preceding rounds.
- GPU access is mediated through an explicit locking API: for agents and GPUs , the lock relation enforces exclusivity, i.e., .
- Denied actions include host filesystem access, network traffic not proxied through managed search/browser endpoints, container termination, and evaluator reading.
2.2. Artifact Engineering
The agent’s actions and discoveries are persistently tracked and auditable via integrated filesystem and Git routines:
- Each experimental run maintains a structured directory containing session metadata, proposals, implementation results, and a history Git repository (auto-committed at agent checkpoints).
- Long-term memory is engineered through mandatory Git commits at each solution milestone, with commit messages required to detail both solution properties and change summaries.
- Artifacts include automatic caching of web-search outcomes and browser snapshots, supporting full reproducibility.
Example directory structure:
1
Artifact consolidation after each round involves replaying evaluator feedback (from results.json), scoring, updating ranked solutions, and generating human-readable summaries.
2.3. Budget Engineering
The environment dictates explicit cost and time boundaries for each research process along two axes: wall-clock and LLM API usage.
- Per run: = max rounds, = max parallel implementations, , , = stage budgets, = API-usage cost ceiling.
- The cumulative API cost 0 must satisfy 1, where 2 is the cost of the 3-th LLM call.
- Time is monitored per stage, e.g., 4.
- Budget monitoring is both passive (prompt-level warnings at 90% usage) and active (agent-callable queries for time/cost remaining).
- Budget exhaustion results in immediate abortion, state preservation, and supports safe resumption upon intervention.
2.4. Human-in-the-Loop Engineering
Supervision and intervention are robustly supported through:
- Terminal UI: live CLI feeds per agent session, with real-time inspection and interaction capabilities.
- Web Monitor: global run aggregation, including evolution plots, transcripts, and budget visualizations, provides direct control over run execution.
- Interaction loop: the agent is pause-able at any stage for supervisor review and clarification, with human inputs injected into subsequent agent prompts at the system-message level.
3. System Architecture and Agent Workflow
EurekAgent’s architecture comprises the following tightly integrated components:
- Agent Sessions: off-the-shelf CLI LLM agents (e.g., Claude Code) orchestrated as containerized processes.
- Environment Controller: Python daemon managing Docker lifecycle, cost/time budgets, artifact states, and inter-process communication.
- Hidden Evaluator: microservice for secure, shielded grading, accessible only by API.
- Artifact Store: a composite of run directories and Git repositories.
- Budget Monitor: tracks real-time consumption on both axes.
- UI Layer: provides both terminal-based and web-based monitoring and intervention.
Main Agent Loop
The core agent loop is formalized as Algorithm 1 in the source:
- Launch preparation session (in Docker)
- Agent performs “prepare” stage (with install, evaluator probing)
- For 5:
- Propose stage: agent returns up to 6 hypotheses
- Implement stage (parallel for 7): hypotheses are coded/tested in isolation
- Submissions are validated and scored, with results ranked and persisted
- Early stop if improvement stagnates or budgets exhausted
- Return best-scoring solution and full artifact log
Pseudocode corresponds directly to resource enforcement and artifact capture semantics outlined above.
4. Empirical Results and Performance Analysis
4.1. Task Domains
EurekAgent is validated across three canonical scientific domains:
- Mathematics Optimization: including 26-circle packing (maximize 8, subject to non-overlap), Erdős’ minimum overlap, and first autocorrelation inequality tasks.
- Kernel Engineering: triangular matrix-multiplication (TriMul), with objectives centered on minimizing geometric-mean runtime on an A100 GPU.
- Machine Learning Engineering: tasks drawn from the MLE-Bench Lite suite (e.g., cancer detection, plant pathology, image and text categorization).
4.2. Evaluation Metrics and Setup
- All experiments use the CLI variant of Claude Code underpinned by GLM-5.1, with web-based knowledge retrieval enabled.
- Task-specific hyperparameters are configured to maintain an API cost below $17 per task.
- Evaluation utilizes task-native metrics (optimization objective, geometric-mean runtime, or Kaggle-style medal rates).
- TriMul benchmarking follows a reproducible protocol: 3 warmup, 10 measured rounds, median/mean runtime reporting.
4.3. Results Summary
For the flagship 26-circle packing problem, EurekAgent achieved $G$9, surpassing both the best human (approx. 2.634000) and previous AI (2.635986) results, with a total API cost of approximately $\operatorname{Lock}\subseteq A\times G$0. Optimization emerged over 5 rounds, with initial rapid progress and final convergence via agent-discovered local optimization strategies (e.g., joint SLSQP and heuristics).
Key results across tasks are summarized as:
| Task | EurekAgent Score | Prior Best | Improvement |
|---|---|---|---|
| Circle Packing | 2.635999 | 2.635986 (AI) | New state-of-the-art |
| Erdős Overlap | 0.380870 | 0.380876 | Lower is better |
| 1st Autocorr (C*) | 1.502861 | 1.502863 | Lower is better |
| TriMul Runtime (median, μs) | 2005.03 | 2096.04 | 4.3% faster |
| MLE-Bench Lite (Any Medal) | 85.71% | 71.43% | Higher is better |
| MLE-Bench Lite (Gold Rate) | 71.43% | 57.14% | Higher is better |
All reported gains are substantiated under fixed evaluators and strict computational budgets, with substantial improvements both statistically and in reproducibility (Xin et al., 11 Jun 2026).
5. Open Source Infrastructure and Reproducibility
EurekAgent is released as open-source software for full reproducibility. The repository (https://github.com/THU-Team-Eureka/EurekAgent) contains:
- Modular Python package with controller, sandbox, evaluator client, artifact manager, and UI layers.
- Examples illustrating all benchmarks and scripts for controlled experiment execution and resumption.
- Reproducible artifact layout (as above), exhaustive logs, and detailed solution provenance (via Git).
- Simple setup: clone, install dependencies, configure Docker, and execute benchmark scripts (with API key). Both terminal and web-based interfaces are provided for real-time inspection and control.
Consistent with the environment engineering paradigm, EurekAgent demonstrates that environment-focused design enables robust, cost-effective, and high-performance autonomous research agents using only general-purpose LLM tools—without reliance on task-specific workflow scripting or fine-tuning (Xin et al., 11 Jun 2026).