DSI-Bench: Multi-Domain Benchmarking

Updated 12 February 2026

DSI-Bench is an umbrella term for three benchmark frameworks, each targeting dependency inference, dynamic spatial reasoning, and distributed systems performance.
The dependency inference track focuses on reconstructing masked manifest files from codebases to ensure CI executability, highlighting challenges in precision and contextual limitations.
The dynamic spatial reasoning and distributed systems tracks evaluate multi-agent motion understanding in vision-language models and automate performance testing on cloud/CI infrastructures respectively.

DSI-Bench is an overloaded term within computational research and engineering, referencing three distinct, high-impact benchmarking frameworks across software dependency inference, dynamic spatial reasoning, and automated distributed systems testing. Each instantiation targets a specific research bottleneck by curating rigorous datasets, controlled evaluation protocols, and reproducible analysis workflows.

1. Software Dependency Inference and CI-Verified Repository Synthesis

In the context of LLM evaluation, DSI-Bench (pronounced as “DI-Bench” and at times “DSI-Bench”) refers to a benchmark for dependency inference—the task of inferring internal and external library requirements from source code, critical for end-to-end repository synthesis. The framework was introduced by Xia et al. and comprises 581 GitHub repositories spanning Python, Rust, C#, and JavaScript, each validated for out-of-the-box executability via custom GitHub Actions workflows (Zhang et al., 23 Jan 2025).

The dependency inference challenge is formalized as reconstructing masked manifest files (e.g., pyproject.toml, Cargo.toml, .csproj, package.json) given the codebase $R$ and builds files $\{b_1^m, \ldots, b_k^m\}$ with all dependency declarations removed:

$\mathcal{F}:\bigl(R,\{b_1^m,\ldots,b_k^m\}\bigr)\;\longrightarrow\;\{b_1,\ldots,b_k\}$

where the goal is that, upon restoration, $R$ passes all original CI jobs. The curation pipeline is fully automated: repositories are filtered for stars, size, and extant workflows, then executed under a local GitHub Actions runner (Act), and only those passing CI checks are retained. All dependency declarations are masked in both manifest and lock files, creating a realistic inference task for LLMs.

The primary motivation is rooted in the observation that dependency errors cause over 40% of runtime failures in LLM-generated repositories, significantly impeding downstream testing and deployment.

2. Benchmark Design, Metrics, and Model Evaluation Protocols

DSI-Bench evaluates dependency inference via dual textual and execution-based scoring:

Textual Accuracy: Let $G$ be the ground-truth dependency set, $P$ the model prediction.

$\mathrm{Precision} = \frac{|G\cap P|}{|P|}, \quad \mathrm{Recall} = \frac{|G\cap P|}{|G|}, \quad F_1 = 2\cdot\frac{\mathrm{Precision}\times\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Execution Pass Rate (EPR): The fraction of repositories that build and pass CI with model-generated manifests:

$\mathrm{EPR} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\text{repo}_i\text{ builds %%%%7%%%% tests pass}\}$

Fake Rate: Proportion of predicted dependencies absent from all registries, measuring hallucination.

Prompting strategies include All-In-One (entire codebase as context), File-Iterate (per-file reasoning followed by merge), and Imports-Only (parsing import/use statements). On the regular-sized subset, GPT-4o with the All-In-One strategy attains 42.9% EPR in Python (Precision = 61.8%, Recall = 73.6%, Fake = 2.8%), with similar EPR in JavaScript. Compiled languages (Rust/C#) yield much lower EPR (11.2–13.5%) despite higher textual precision, indicating manifest schema complexity and stringent version constraints as primary barriers.

Textual metrics systematically overestimate success: in Python, GPT-4o achieves 73.6% recall translating to 42.9% actual executability, largely due to errors in version bounds or missing transitive dependencies. “Oracle” version constraints dramatically increase EPR: +28% (Python), +246% (Rust). For LLMs, context length limitations on large repos (>120 K tokens) lead to a >50% drop in execution success, underscoring the challenge of maintaining global dependency graphs.

3. Dynamic Spatial Intelligence Benchmark for Vision-LLMs

A separate, independently-developed DSI-Bench evaluates vision-language and expertise models for reasoning about dynamic spatial relationships involving simultaneous observer (camera/agent) and object motion (Zhang et al., 21 Oct 2025). The dataset consists of approximately 943 dynamic video clips and 1,700+ manually-annotated QA pairs, covering nine decoupled observer–object motion patterns (combinations of translation, rotation, or both).

Spatial and temporal symmetry is enforced by augmenting each clip into four variants (original, horizontal flip, time-reversal, both), eliminating left/right and forward/backward semantic biases. The taxonomy spans six question types: object–scene, observer–scene, distance, and orientation changes, in both static and dynamic scenes.

The benchmark exposes limitations in state-of-the-art VLMs and expertise models: models often conflate observer and object motion, display forward motion biases (choosing “forward” 60–70% of the time when correct only ~30%), and fail to robustly track relative distance and orientation in dynamic frames. Sample-wise accuracy for VLMs is 35–47% (vs. 25% random); expertise models like SpatialTrackerV2 achieve ~42%. Free-form reasoning strategies (stepwise chain of thought) yield at most marginal gains, as visual perception errors dominate.

4. Distributed Systems Infrastructure Benchmarking in Database CI

In a third context, DSI-Bench denotes the Distributed Systems Infrastructure (DSI) for automated system performance benchmarking at MongoDB (Ingo et al., 2020). DSI operationalizes end-to-end performance testing in CI environments by automating:

Bootstrap
Infrastructure provisioning (via YAML-to-Terraform translation)
System and workload setup
MongoDB cluster deployment (parameterized via YAML)
Test control (YAML-driven, no CLI flags)
Metrics analysis
Infrastructure teardown

The framework supports >200 benchmarks × 20 MongoDB configurations in daily CI, using YAML for all experiment configuration, guaranteeing reproducibility. DSI integrates best practices for public cloud benchmarking (e.g., CPU isolation, cache clearing) and employs an external statistical change-point detection pipeline for alerting on significant performance regressions or improvements.

Metrics captured include MongoDB FTDC (1000+ server metrics), system-level statistics (CPU, disk, network, memory), and detailed YCSB/Linkbench percentiles. Data are archived, visualized in Evergreen CI, and used for automatic alerting (Jira integration). DSI was open-sourced under Apache 2.0, with modular documented configuration and turn-key cloud/on-prem support.

5. Comparative Analysis and Ongoing Challenges

The three DSI-Bench frameworks each address a fundamental challenge in their domain:

Context	Key Task	Principal Obstacles
Dependency Inference (Software Repos)	Manifest completion, CI-run	Long-context reasoning, metadata precision, hallucination, schema complexity
Dynamic Spatial Intelligence (VLMs)	Multi-agent motion VQA	Observer-object decoupling, spatio-temporal memory, semantic bias
Distributed Systems Infrastructure (DB CI)	Automated perf. benchmarking	Cloud noise, reproducible infrastructure, realtime diagnostics

This suggests that DSI-Bench, while a shared name, marks an archetype: a turn-key, scalable, and adversarially-designed benchmark exposing systemic deficiencies in automated reasoning systems.

6. Future Directions and Open Problems

Future improvements across DSI-Bench instantiations are centered on:

Retrieval-augmented prompting (dependency inference) to resolve version and registry accuracy;
Manifest schema and multi-task training for LLMs, and dynamic analysis feedback;
Integration of 3D-geometric modules, spatio-temporal attention, and geometric consistency losses for dynamic spatial reasoning;
Support for extended motion patterns, multi-object interactions, and longer dynamic horizons;
In distributed systems, further cloud/on-prem abstraction, automated tuning, and more granular real-time alerting.

Each track underscores the need for benchmarks that are (i) large-scale, (ii) adversarially precise, and (iii) systematically updatable, enabling both model-centric and system-centric progress (Zhang et al., 23 Jan 2025, Zhang et al., 21 Oct 2025, Ingo et al., 2020).