EnvBench-Python: A Benchmark Ecosystem

Updated 2 October 2025

EnvBench-Python is a benchmarking ecosystem that evaluates Python environments, focusing on complex dependency setups, runtime analysis, and automated debugging.
It employs precise metrics like pass@1 (up to 6.69% success) and avgErrs (approximately 52 errors) to assess environment setup quality.
The suite integrates multiple resources, including repositories, executable projects, bug databases, and profiling tools, to support reproducible research in Python engineering.

EnvBench-Python is a benchmarking and testing ecosystem specializing in Python program environments, configuration, runtime analysis, optimization, and debugging. This concept covers benchmarks for environment setup, executable codebases, numerical optimization test functions, profiling tools, and real-world bug databases, addressing the complex landscape of Python reproducibility and automation in contemporary software engineering and ML workflows.

1. Benchmark Scope and Composition

EnvBench-Python, as defined by the recent benchmarking initiatives, centers its focus on repositories and test environments that present genuine configuration and operational challenges for Python. The current benchmark catalog includes:

329 Python repositories selected explicitly for nontrivial setup requirements, such as complex or multi-manager dependency specifications (requirements.txt, setup.py, pyproject.toml/poetry).
Exclusion of trivial repositories that can be set up via deterministic scripts, ensuring that the benchmarks reflect real-world challenges (Eliseeva et al., 18 Mar 2025).
Supplementary assets, including large-scale executable project sets (50 projects, 681k lines of code), bug databases (493 curated bugs), and standard optimization function libraries.

This aggregation enables evaluation of agentic, automated, and ML-powered workflows under rigorous, domain-representative conditions.

2. Environment Setup and Evaluation Metrics

Automated setup evaluation in EnvBench-Python uses static analysis and error-tracking as primary metrics:

Environment setup scripts are executed within Docker containers, providing isolation and reproducibility.
Static analysis via pyright determines unresolved Python imports ("reportMissingImports"), while shell exit codes signal installation success.
Two formal metrics are employed:
- pass@1: An environment setup attempt counts as successful only if both the shell script returns zero and pyright reports zero missing imports,
  
  $\text{pass@1} = \begin{cases} 1, & \text{if } (\text{exit\_code} = 0) \land (E = 0) \ 0, & \text{otherwise} \end{cases}$
where $E$ is the number of missing import errors. - avgErrs: The mean number of unresolved imports across successfully executed scripts,

$\text{avgErrs} = \frac{1}{N} \sum_i E_i$

where $E_i$ is the error count in repository $i$ , restricted to cases with zero exit code.

In recent evaluations:

The Bash Agent (an iterative ReAct-based LLM strategy) with GPT-4o achieved the best pass@1 for Python at 6.69%.
Even among "successful" runs, avgErrs remains substantial (∼52), illustrating the persistent complexity of Python environment setup.

3. Agentic And ML-Based Setup Strategies

Three principal configuration workflows have been assessed: | Approach | LLM Backbone | pass@1 (%) | avgErrs (errors) | |--------------------------|--------------|------------|-------------------| | Zero-shot LLM | GPT-4o | 5.47 | higher | | Installamatic Agent | GPT-4o | 4.86 | highest | | Bash Agent (ReAct) | GPT-4o | 6.69 | lowest |

Zero-shot LLM: Single request strategy; baseline configuration quality.
Installamatic Agent: Two-phase search and assembly using context and installation instructions.
Bash Agent: Iterative refinement, allowing the LLM to execute and revise shell commands in response to real-time feedback.

These results suggest that iterative, feedback-driven agentic approaches are incrementally more effective for complex Python environments (Eliseeva et al., 18 Mar 2025).

4. Ancillary Benchmark Ecosystem

EnvBench-Python draws upon multiple, compatible benchmarking suites:

DyPyBench (Bouzenia et al., 1 Mar 2024): A suite of 50 real-world executable projects, designed for dynamic analysis—call graph extraction, value-use logging, and API usage mining. Instrumentation hooks (e.g., sys.settrace) generate dynamic traces, enabling empirical comparisons with static analyses:

$G_{\text{dynamic}} = (F, E_{\text{dynamic}}),\quad E_{\text{dynamic}} = \{\, (f, g) \mid f \text{ calls } g \,\}$
BugsInPy (Widyasari et al., 27 Jan 2024): Database of 493 manually curated bugs from 17 popular projects, each paired with exposing test cases and patch diffs. The database abstraction layer harmonizes test invocation, mutation analysis, and code coverage tooling, supporting controlled, reproducible debugging research.
Python Benchmark Functions Framework (Baronti et al., 23 Jun 2024): Library of multi-modal, continuous optimization functions (e.g., Ackley, Schwefel, Easom) instantiable in arbitrary dimensions. Functions encapsulate meta-information—search boundaries, optima locations, LaTeX formulas, and bibliographic references. Interactive visualization and CI routines for local optima validation provide robust reliability for optimization studies.

5. Runtime Analysis, Profiling, and Performance

Optimization of Python environments and workflows extends beyond basic setup. Profiling tools such as Scalene (Berger et al., 2022) provide multidimensional insight:

Precision profiling of CPU, memory, and GPU usage.
Sampling-based attribution of interpreter versus native library execution, supporting informed optimization decisions.
Detection of memory leaks using threshold-based sampling and Laplace's Rule of Succession.
Quantification of copy volume (MB/sec) across Python-native and CPU-GPU data boundaries, identifying costly conversions and unintended bottlenecks.

These capabilities facilitate fine-grained, quantitative assessment of environments set up by agentic or automated workflows.

6. Significance, Limitations, and Future Directions

The EnvBench-Python collection, its static analysis metrics, and companion benchmarks underscore several points:

Automated setup remains challenging: Even with advanced LLM-based workflows, success rates for full environment setup (zero missing imports, working installations) remain below 7%. The landscape for scalable, robust Python reproducibility is still an open research problem.
Complexity originates from dependency specifications, Python versioning, and dynamic imports—which are not reliably handled by naive baseline or single-shot strategies.
Static analysis (e.g., pyright) provides scalable means for benchmark extension, offering lightweight and automated grading of setup completeness.

A plausible implication is that further improvements will require hybrid approaches—combining static analysis, dynamic execution feedback, multi-turn agentic interaction, and perhaps integration with bug databases and runtime profiling. Future work may focus on extending framework taxonomies, provision of derivative data for optimization, and more sophisticated bug scenario modeling, as well as larger-scale integration between agentic and feedback-driven configuration workflows.

7. Accessibility and Benchmark Infrastructure

EnvBench and its Python suite are publicly accessible:

Benchmark suite, dataset, and experiment logs: https://github.com/JetBrains-Research/EnvBench, https://jb.gg/envbench
Docker-based infrastructure ensures reproducibility and standardized execution environments.
Ancillary frameworks (DyPyBench, BugsInPy, Python Benchmark Functions) are also available, offering integrated solutions for profiling, analysis, optimization, and controlled testing.

This openly available, rigorously constructed ecosystem enables reproducible research, model tuning, and systematic comparison across environment setup and performance optimization in Python software engineering.