DI-BENCH: LLM Dependency Inference Benchmark

Updated 14 June 2026

DI-BENCH is a benchmark that assesses LLMs’ ability to accurately reconstruct masked dependency declarations in real-world code repositories.
It utilizes a diverse, stratified dataset across Python, C#, Rust, and JavaScript, ensuring each repository's CI-based testability.
The benchmark highlights LLM limitations in handling long-context reasoning, versioning precision, and dependency hallucinations, impacting automated code generation.

DI-BENCH is a comprehensive benchmark for evaluating LLMs on the task of dependency inference—specifically, reconstructing all internal and external package dependencies required to build and test real-world software repositories. This problem has emerged as fundamental for automated code generation, with dependency inference errors accounting for over 40% of observed runtime failures in LLM-generated repositories. DI-BENCH combines scale, diversity, and execution-based assessment, uniquely measuring whether a candidate dependency list permits full, automated verification via each repository’s original continuous integration (CI) environment (Zhang et al., 23 Jan 2025).

1. Benchmark Composition and Dataset Construction

DI-BENCH is constructed from a curated collection of 581 open-source software repositories spanning four major programming languages: Python, C#, Rust, and JavaScript. Repository selection emphasizes both real-world relevance and CI-testability. The inclusion criteria are as follows:

Minimum 100 GitHub stars;
Repository size ≤ 10 MB;
Presence of a .github/workflows directory for GitHub Actions support.

The dataset is stratified into two size tiers:

Subset	Number of repos	Avg. files/repo	Avg. dependencies	Avg. tokens/repo
Regular	387	33.9	11.9	29 K
Large	194	218.0	29.7	385 K

Repositories were selected to reflect a long-tail real-world distribution. Python and JavaScript dominate the small-to-medium regime, while Rust and C# represent more complex, dependency-heavy codebases.

Each repository is rendered testable by leveraging its native CI pipeline. An automated harness uses the open-source act runner to execute the original GitHub Actions test jobs, ensuring no manual reconfiguration or environment hacking. A semi-automated pipeline first locates repo test jobs using LLMs, then locally validates that these jobs run in a sandboxed environment, only retaining successful runs within the benchmark.

Dependency masking is performed by removing or blanking out dependency-declaration sections (e.g., [project.dependencies] in pyproject.toml, <PackageReference> in .csproj files). The dependency inference task is then defined as reconstructing these masked sections such that the repository passes all tests under its original CI workflow (Zhang et al., 23 Jan 2025).

2. Dependency Inference Task Formalization

Given as input a software repository $R$ and its build configuration files $\{b_i^m\}$ with dependency-declaration regions masked, the dependency inference problem is to output the set $\{b_i\}$ , where each $b_i$ restores the correct dependency definitions:

$\mathcal{F} \colon (R, \{b_i^m\}_{i=1}^k) \rightarrow \{b_i\}_{i=1}^k$

Here, each $b_i^m$ (e.g., pyproject.toml with dependencies removed) must be reconstructed so that inserting the predicted dependencies yields a fully functional build, passing all repository tests. The ground-truth is precisely the dependency list that enables successful test completion in the original environment.

This formulation captures all sources of dependency error: missing/extraneous packages, incorrect version constraints, local/internal modules, extras/features, and ecosystem consistency.

3. Evaluation Protocol and Performance Metrics

DI-BENCH applies both textual and execution-based evaluation. The primary axes are:

Precision, Recall, F₁: These are computed per repository by comparing the set of predicted dependencies $D_\text{pred}$ to the ground-truth $D_\text{gt}$ :

$\text{Precision} = \frac{|D_\text{pred} \cap D_\text{gt}|}{|D_\text{pred}|}$

$\text{Recall} = \frac{|D_\text{pred} \cap D_\text{gt}|}{|D_\text{gt}|}$

$\{b_i^m\}$ 0

Execution (Executability) Rate: The proportion of instances where, after inserting $\{b_i^m\}$ 1 into the masked build files, all CI tests pass successfully. This metric captures holistic correctness: any omission, version error, hallucination (“fake” dependency), or misspecification results in failure.
Fake Rate: The fraction of generated dependencies not found in either the relevant package ecosystem or the local project tree.

Execution-based metrics are prioritized, as successful repository build and test serves as the definitive end-to-end verification (Zhang et al., 23 Jan 2025).

4. Experimental Setup and Prompting Paradigms

DI-BENCH supports several prompting and inference strategies for LLM evaluation:

All-In-One: Concatenate all source files plus masked build files into a single input prompt. The model predicts all dependency sections in one forward pass.
File-Iterate: For each source file, prompt the model for required dependencies, aggregate results, then prompt for the merged final build.
Imports-Only: Extract only import/use statements (via tree-sitter), provide these and the masked build files as model context.

A diverse set of models was benchmarked, including proprietary (GPT-4o, GPT-4o-mini) and open-source (Llama-3.1-8B-Instruct, DeepSeek-Coder-V2-Lite-Instruct, Qwen2.5-Coder-Instruct) architectures, all supporting context windows up to 128 K tokens on 4 × A100 GPUs.

Ablations include oracle substitution of model-predicted versions with ground-truth constraints (quantifying the specific impact of version errors), and hallucination filtering (removal of fake dependencies before test execution) (Zhang et al., 23 Jan 2025).

5. Core Results and Observations

The state-of-the-art GPT-4o achieves only a 42.9% execution pass rate on the regular Python subset (≤120 K tokens) under the All-In-One strategy; all models perform substantially worse on larger repositories, with frequent context-window overflows or severe performance degradation.

Selected cross-model, cross-language results (All-In-One, regular repos):

Model	Python Exec	Rust Exec	JS Exec	Precision (Py)	Recall (Py)	F₁ (Py)	Fake Rate (Py)
GPT-4o	42.9%	11.2%	43.2%	61.8%	73.6%	67.2%	2.8%
GPT-4o-mini	24.5%	4.7%	24.9%	56.5%	57.5%	57.0%	2.0%
Llama-3.1-8B	13.3%	5.0%	21.9%	28.8%	38.4%	32.9%	4.3%
DeepSeek-16B (MoE)	17.3%	8.9%	26.8%	48.0%	48.6%	48.3%	18.4%
Qwen2.5-7B	22.4%	6.5%	16.7%	55.4%	44.7%	49.5%	5.3%

For the large subset, the execution rate is even lower; most All-In-One methods exceed available context memory, and modular strategies (e.g., File-Iterate, Imports-Only) incur compound inference errors, with execution rates frequently below 10% for all languages except JavaScript (Zhang et al., 23 Jan 2025).

The dominant failure mode is missing dependencies, followed by erroneous or hallucinated package names and incorrect version specifications. Versioning errors are especially damaging—substituting ground-truth constraints (“oracle metadata”) boosts Python execution from 42.9% to 55.1%, and Rust from 11.2% to 38.8%. Hallucination removal provides only incremental improvement, suggesting that most failures arise from omission rather than commission.

All models display negative scalability: as repository size (tokens, files, dependencies) grows, executable pass rates drop monotonically, exposing current LLMs’ limitations in long-context reasoning and holistic codebase understanding.

6. Challenges Identified and Prospective Solutions

DI-BENCH reveals critical gaps in current LLMs for end-to-end software synthesis:

Long-context fidelity: All-In-One methods fail outright on projects with token counts near current context limits (≥120K tokens), while modularized splitting (File-Iterate or Imports-Only) fails to preserve global dependency integrity.
Metadata and version constraint reasoning: Precise versions, extras, features, and inter-package conditions remain a severe bottleneck.
Hallucination and ecosystem grounding: Even low fake rates (2–6%) materially impact real-world CI performance; models lack robust external validation mechanisms.
Open-source vs. proprietary performance disparity: Proprietary models (e.g., GPT-4o) continue to outperform open-source LLMs by large margins, particularly on execution-critical tasks.

Suggested improvement avenues include hybrid pipelines that combine static analysis (e.g., AST import resolution) with LLM completion, dynamic retrieval of registry metadata, retrieval-augmented and multi-agent inference procedures for iterative constraint refinement, and continued scaling of LLM context capability (Zhang et al., 23 Jan 2025).

7. Significance and Impact

DI-BENCH is the first benchmark to isolate dependency inference as a stand-alone, execution-verified capability in LLM-based code generation. By combining a large, multi-language corpus with stringent, reproducible CI-based assessment, DI-BENCH sets a clear standard against which future model improvements can be measured. The observed maximum execution pass rate of 43% (for “small” Python repos) underscores that reliable automated software construction remains unsolved. Given the centrality of dependency inference to real-world repository usability, DI-BENCH positions this problem as a core blocking challenge for robust, end-to-end generative software engineering (Zhang et al., 23 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DI-BENCH.