EnConda-Bench: Config Diagnosis Benchmark

Updated 15 December 2025

EnConda-Bench is an environment configuration benchmark that evaluates agents using detailed process trajectories and fine-grained, phase-specific metrics.
It systematically injects realistic README errors, records complete agent trajectories, and enables diagnosis of planning, error detection, and repair capabilities.
The automated, Docker-validated corpus supports evaluations of LLM agents, providing actionable insights into performance gaps from error perception to successful repairs.

EnConda-Bench is the Environment Configuration Diagnosis Benchmark, designed to provide process-level trajectory evaluation of agent capabilities during environment configuration for software engineering tasks. Unlike conventional benchmarks that reduce evaluation to binary build/test outcomes, EnConda-Bench systematically injects realistic README errors, records full agent trajectories, and decomposes agent performance into fine-grained, phase-specific metrics. This enables diagnosis of module-level capabilities in environment setup–planning, perception-driven error diagnosis, feedback-driven repair, and final environment execution. The benchmark corpus is automatically constructed and validated in Docker for reproducibility at scale, supporting nuanced evaluation across LLM-based agents and agent frameworks (Kuang et al., 29 Oct 2025).

1. Motivation and Objectives

Environment configuration comprises critical steps such as system package installation, Python dependency pinning, virtual environment activation, and test execution. Existing LLM-agent benchmarks simplify evaluation to build/test pass rates, which obscure the root causes of agent failures by not isolating breakdowns in planning, error diagnosis, repair, or command execution. EnConda-Bench addresses this limitation by:

Injecting realistic, human-like errors into correct READMEs.
Recording agent process trajectories including shell plan, error perception, feedback, and subsequent actions.
Providing process-level metrics alongside aggregate outcomes to identify specific deficiencies and areas for targeted improvement.

This approach facilitates granular diagnosis (e.g., agents frequently identify errors yet cannot translate them into correct shell commands), and supports automated scaling to thousands of Docker-validated instances.

2. Task Definition and Corpus Construction

EnConda-Bench tasks are defined for agents given a pinned repository (fixed Git commit), full file tree, and a minimal base Docker image such as Ubuntu 22.04 with Conda. The README.md file is synthetically modified to contain between 1 and 10 injected errors, sourced from six canonical categories:

E1: Dependency Installation Error
E2: Command Usage Error
E4: File Path Error
E6: Logical Order Error
E7: Version Compatibility Error
E8: Other

Each erroneous README is accompanied by a gold-standard JSON annotation, specifying the injected error types, natural-language descriptions, candidate fixes, and the true fix command sequence.

Agent responsibilities for each instance include:

Planning an initial shell command sequence by reading the flawed README.
Executing commands in Docker, perceiving and localizing failures, and predicting error type.
Producing repair suggestions based on error feedback.
Iteratively updating and executing a shell script until environment setup completes.

Process trajectories (plan, perception, diagnosis, repairs, and final execution) are recorded for subsequent evaluation.

The benchmark corpus is constructed via a fully automated pipeline:

Repository Selection: 323 high-quality GitHub repositories (≥10 stars, ≥1 000 commits, ≥10 closed issues), each pinned to a specific commit.
README Error Injection: Six error categories injected by prompting Claude-4-sonnet and Gemini 2.5-Pro to minimally edit README and produce structured JSON error annotations.
Scaling: Initial synthesis produces 1 772 two-error READMEs, split and merged to yield 4 201 READMEs covering 1–10+ errors (9 471 total errors).
Automatic Validation: GPT-4.1-mini generates strict bash scripts per README; scripts are tested in Docker, and invalid injections are discarded.
LLM Filtering and Human Verification: GPT-4.1-mini and human annotators confirm error validity, type, and fix correctness; final corpus retains high-quality, reproducible instances.

3. Process Phases and Evaluation Metrics

EnConda-Bench decomposes the environment configuration process into four successive phases, each with corresponding evaluation metrics and data recording:

3.1 Environment Setup Planning

Task: Infer an executable sequence of shell commands from the (possibly erroneous) README.
Recording: Log initial plan as command sequence $\mathcal{P} = (c_1, c_2, \dots, c_L)$ .
Evaluation: No explicit planning accuracy label—downstream impact is measured via first-round success. A plausible implication is that immediate reproduction of error at the correct step denotes adequate planning coverage.

3.2 Perception-Driven Error Diagnosis

Task: Upon command failure, the agent must localize the error (step index $j$ ), predict error type ( $\hat{t}_j$ ), and provide a concise description ( $\hat{d}_j$ ).
Recording: Extract agent’s JSON error block {“error_type”: $\hat{t}_j$ , “error_description”: $\hat{d}_j$ }.
Metrics: Precision, recall, and F1-score are computed against ground truth:

$\mathrm{Precision} = \frac{|\{j:\hat t_j = t_j^*\}|}{|\{\hat t_j\}|},\quad \mathrm{Recall} = \frac{|\{j:\hat t_j = t_j^*\}|}{|\{t_j^*\}|},\quad F_1 = \frac{2\,\mathrm{Precision}\,\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Additional metric: error description accuracy.

3.3 Feedback-Driven Repair

Task: For each error ( $j, \hat{t}_j, \hat{d}_j$ ), produce one or more shell commands ( $\hat c_j^{\text{fix}}$ ) designed to resolve the error.
Recording: Store agent’s fix_suggestion fields.
Metric: Repair Success Rate (RSR)

$\mathrm{RSR} = \frac{\sum_{i=1}^N |\{j: \hat c_{i,j}^{\text{fix}} \text{ resolves error } j\}|}{\sum_{i=1}^N |G_i|}$

where $G_i$ is the set of ground-truth errors in instance $i$ . Successful repair requires the suggested command to resolve the error when rerun in isolation.

3.4 Final Environment Execution

Task: The agent emits a shell script integrating all repairs and remaining steps.
Recording: Capture and execute the final script in Docker: $\mathtt{bash}\ S$ .
Metric: End-to-End Success Rate (E2E)

$\mathrm{E2E} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}\{\text{build\_and\_test\_pass}(S_i)\}$

where $\mathbb{I}$ is the indicator function.

The suite of metrics (Perception F1, repair success, end-to-end success, error description accuracy) enables multi-dimensional capability profiling.

4. Agent Benchmarking and Empirical Findings

Four LLMs and three agent settings are comprehensively evaluated:

LLMs: GPT-4.1, Claude-4-sonnet, Gemini 2.5-Pro, DeepSeek-V3
Agent Types:
- Zero-Shot (direct prompting)
- Code Agents (OpenHands, SWE-Agent)
- Environment-Setup Agents (INSTALLAMATIC, Repo2Run)

Empirical results show:

Agent Type	Perception F1	Repair Success (%)	End-to-End Success (Pass@1, %)
Zero-Shot (GPT-4.1)	48.8	<30	<3
Code Agent (OpenHands)	~60.1	~36	~10
Env-Setup (Repo2Run)	~60.6	47.3	22.9

Zero-Shot agents achieve high recall (>70%) but low precision (<40%) in error typing, and poor end-to-end rates.
Code Agents moderately improve perception F1 (~60%) but struggle to convert accurate diagnosis into effective executable fixes.
Environment-Setup Agents have the highest repair and end-to-end performance, but a substantial gap persists from perception F1 through fix accuracy to shell script execution.

Systematic tendencies include overuse of the catch-all E8 “Other” category and frequent under-detection of E4 “File Path Error.” Increasing output token counts improves error description accuracy but does not substantially affect end-to-end success. Case studies show agents may classify errors correctly but fail to execute fixes or introduce new errors.

5. Automated Instance Pipeline: Scalability and Validation

The benchmark’s large scale is achieved through a rigorous automated pipeline:

Repository selection and pinning ensure stability.
README error injection uses models (Claude-4-sonnet, Gemini 2.5-Pro) to generate minimal edits and error annotations.
Strict shell script validation in Docker (Ubuntu 22.04 + Conda) discards invalid injections and ensures reproducibility.
LLM-assisted filtering (GPT-4.1-mini) and human verification yield high-quality error annotations (LLM–human agreement 98.5%).
Final dataset comprises 4 201 unique README instances with 9 471 distinct errors.

A plausible implication is that such large-scale, validated datasets offer a reproducible foundation for future research and iterative agent improvement.

6. Recommendations and Directions

Based on process-level analysis, actionable recommendations for the field include:

Trajectory-Aware Fine-Tuning: Enhancing LLMs with process trajectory data (planning ⇾ perception ⇾ feedback ⇾ action) to improve feedback-to-action mapping.
Explicit Error-Focused Chain-of-Thought: Training agents to reason explicitly about environment state changes and error corrections to reduce reliance on generic error categories.
Multi-Round Interactive Looping: Allow agents to iteratively re-probe environments post-repair to validate effectiveness.
Tooling for Environment Introspection: Integrating system query tools (“conda list,” “which gcc”) for more informed repair actions.
Broader Ecosystem Coverage: Expanding coverage beyond Python (Java, Node.js) and diversifying error typology.
Reinforcement Learning with Environment Rewards: Using sparse build/test pass signals as multi-step trajectory rewards.

This suggests a research direction focused on agent architectures and training regimes that improve action-taking fidelity following diagnosis.

7. Significance and Availability

EnConda-Bench introduces the first process-level, large-scale benchmark for evaluating environment configuration agents. By leveraging controlled error injection, Docker-based validation, and trajectory recording, it surfaces fine-grained diagnostics not visible in traditional benchmarks. Findings indicate that state-of-the-art LLM agents excel at error detection but substantially lag in executing repairs, with proficiency gaps between perception and action phases. The publicly released dataset, evaluation suite, and automated data construction pipeline are positioned to drive advances in software engineering automation (Kuang et al., 29 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EnConda-Bench.