EnConda-Bench: Config Diagnosis Benchmark
- EnConda-Bench is an environment configuration benchmark that evaluates agents using detailed process trajectories and fine-grained, phase-specific metrics.
- It systematically injects realistic README errors, records complete agent trajectories, and enables diagnosis of planning, error detection, and repair capabilities.
- The automated, Docker-validated corpus supports evaluations of LLM agents, providing actionable insights into performance gaps from error perception to successful repairs.
EnConda-Bench is the Environment Configuration Diagnosis Benchmark, designed to provide process-level trajectory evaluation of agent capabilities during environment configuration for software engineering tasks. Unlike conventional benchmarks that reduce evaluation to binary build/test outcomes, EnConda-Bench systematically injects realistic README errors, records full agent trajectories, and decomposes agent performance into fine-grained, phase-specific metrics. This enables diagnosis of module-level capabilities in environment setup–planning, perception-driven error diagnosis, feedback-driven repair, and final environment execution. The benchmark corpus is automatically constructed and validated in Docker for reproducibility at scale, supporting nuanced evaluation across LLM-based agents and agent frameworks (Kuang et al., 29 Oct 2025).
1. Motivation and Objectives
Environment configuration comprises critical steps such as system package installation, Python dependency pinning, virtual environment activation, and test execution. Existing LLM-agent benchmarks simplify evaluation to build/test pass rates, which obscure the root causes of agent failures by not isolating breakdowns in planning, error diagnosis, repair, or command execution. EnConda-Bench addresses this limitation by:
- Injecting realistic, human-like errors into correct READMEs.
- Recording agent process trajectories including shell plan, error perception, feedback, and subsequent actions.
- Providing process-level metrics alongside aggregate outcomes to identify specific deficiencies and areas for targeted improvement.
This approach facilitates granular diagnosis (e.g., agents frequently identify errors yet cannot translate them into correct shell commands), and supports automated scaling to thousands of Docker-validated instances.
2. Task Definition and Corpus Construction
EnConda-Bench tasks are defined for agents given a pinned repository (fixed Git commit), full file tree, and a minimal base Docker image such as Ubuntu 22.04 with Conda. The README.md file is synthetically modified to contain between 1 and 10 injected errors, sourced from six canonical categories:
- E1: Dependency Installation Error
- E2: Command Usage Error
- E4: File Path Error
- E6: Logical Order Error
- E7: Version Compatibility Error
- E8: Other
Each erroneous README is accompanied by a gold-standard JSON annotation, specifying the injected error types, natural-language descriptions, candidate fixes, and the true fix command sequence.
Agent responsibilities for each instance include:
- Planning an initial shell command sequence by reading the flawed README.
- Executing commands in Docker, perceiving and localizing failures, and predicting error type.
- Producing repair suggestions based on error feedback.
- Iteratively updating and executing a shell script until environment setup completes.
Process trajectories (plan, perception, diagnosis, repairs, and final execution) are recorded for subsequent evaluation.
The benchmark corpus is constructed via a fully automated pipeline:
- Repository Selection: 323 high-quality GitHub repositories (≥10 stars, ≥1 000 commits, ≥10 closed issues), each pinned to a specific commit.
- README Error Injection: Six error categories injected by prompting Claude-4-sonnet and Gemini 2.5-Pro to minimally edit README and produce structured JSON error annotations.
- Scaling: Initial synthesis produces 1 772 two-error READMEs, split and merged to yield 4 201 READMEs covering 1–10+ errors (9 471 total errors).
- Automatic Validation: GPT-4.1-mini generates strict bash scripts per README; scripts are tested in Docker, and invalid injections are discarded.
- LLM Filtering and Human Verification: GPT-4.1-mini and human annotators confirm error validity, type, and fix correctness; final corpus retains high-quality, reproducible instances.
3. Process Phases and Evaluation Metrics
EnConda-Bench decomposes the environment configuration process into four successive phases, each with corresponding evaluation metrics and data recording:
3.1 Environment Setup Planning
- Task: Infer an executable sequence of shell commands from the (possibly erroneous) README.
- Recording: Log initial plan as command sequence .
- Evaluation: No explicit planning accuracy label—downstream impact is measured via first-round success. A plausible implication is that immediate reproduction of error at the correct step denotes adequate planning coverage.
3.2 Perception-Driven Error Diagnosis
- Task: Upon command failure, the agent must localize the error (step index ), predict error type (), and provide a concise description ().
- Recording: Extract agent’s JSON error block {“error_type”: , “error_description”: }.
- Metrics: Precision, recall, and F1-score are computed against ground truth:
Additional metric: error description accuracy.
3.3 Feedback-Driven Repair
- Task: For each error (), produce one or more shell commands () designed to resolve the error.
- Recording: Store agent’s fix_suggestion fields.
- Metric: Repair Success Rate (RSR)
where is the set of ground-truth errors in instance . Successful repair requires the suggested command to resolve the error when rerun in isolation.
3.4 Final Environment Execution
- Task: The agent emits a shell script integrating all repairs and remaining steps.
- Recording: Capture and execute the final script in Docker: .
- Metric: End-to-End Success Rate (E2E)
where is the indicator function.
The suite of metrics (Perception F1, repair success, end-to-end success, error description accuracy) enables multi-dimensional capability profiling.
4. Agent Benchmarking and Empirical Findings
Four LLMs and three agent settings are comprehensively evaluated:
- LLMs: GPT-4.1, Claude-4-sonnet, Gemini 2.5-Pro, DeepSeek-V3
- Agent Types:
- Zero-Shot (direct prompting)
- Code Agents (OpenHands, SWE-Agent)
- Environment-Setup Agents (INSTALLAMATIC, Repo2Run)
Empirical results show:
| Agent Type | Perception F1 | Repair Success (%) | End-to-End Success (Pass@1, %) |
|---|---|---|---|
| Zero-Shot (GPT-4.1) | 48.8 | <30 | <3 |
| Code Agent (OpenHands) | ~60.1 | ~36 | ~10 |
| Env-Setup (Repo2Run) | ~60.6 | 47.3 | 22.9 |
- Zero-Shot agents achieve high recall (>70%) but low precision (<40%) in error typing, and poor end-to-end rates.
- Code Agents moderately improve perception F1 (~60%) but struggle to convert accurate diagnosis into effective executable fixes.
- Environment-Setup Agents have the highest repair and end-to-end performance, but a substantial gap persists from perception F1 through fix accuracy to shell script execution.
Systematic tendencies include overuse of the catch-all E8 “Other” category and frequent under-detection of E4 “File Path Error.” Increasing output token counts improves error description accuracy but does not substantially affect end-to-end success. Case studies show agents may classify errors correctly but fail to execute fixes or introduce new errors.
5. Automated Instance Pipeline: Scalability and Validation
The benchmark’s large scale is achieved through a rigorous automated pipeline:
- Repository selection and pinning ensure stability.
- README error injection uses models (Claude-4-sonnet, Gemini 2.5-Pro) to generate minimal edits and error annotations.
- Strict shell script validation in Docker (Ubuntu 22.04 + Conda) discards invalid injections and ensures reproducibility.
- LLM-assisted filtering (GPT-4.1-mini) and human verification yield high-quality error annotations (LLM–human agreement 98.5%).
- Final dataset comprises 4 201 unique README instances with 9 471 distinct errors.
A plausible implication is that such large-scale, validated datasets offer a reproducible foundation for future research and iterative agent improvement.
6. Recommendations and Directions
Based on process-level analysis, actionable recommendations for the field include:
- Trajectory-Aware Fine-Tuning: Enhancing LLMs with process trajectory data (planning ⇾ perception ⇾ feedback ⇾ action) to improve feedback-to-action mapping.
- Explicit Error-Focused Chain-of-Thought: Training agents to reason explicitly about environment state changes and error corrections to reduce reliance on generic error categories.
- Multi-Round Interactive Looping: Allow agents to iteratively re-probe environments post-repair to validate effectiveness.
- Tooling for Environment Introspection: Integrating system query tools (“conda list,” “which gcc”) for more informed repair actions.
- Broader Ecosystem Coverage: Expanding coverage beyond Python (Java, Node.js) and diversifying error typology.
- Reinforcement Learning with Environment Rewards: Using sparse build/test pass signals as multi-step trajectory rewards.
This suggests a research direction focused on agent architectures and training regimes that improve action-taking fidelity following diagnosis.
7. Significance and Availability
EnConda-Bench introduces the first process-level, large-scale benchmark for evaluating environment configuration agents. By leveraging controlled error injection, Docker-based validation, and trajectory recording, it surfaces fine-grained diagnostics not visible in traditional benchmarks. Findings indicate that state-of-the-art LLM agents excel at error detection but substantially lag in executing repairs, with proficiency gaps between perception and action phases. The publicly released dataset, evaluation suite, and automated data construction pipeline are positioned to drive advances in software engineering automation (Kuang et al., 29 Oct 2025).