AInsteinBench: Scientific Software Benchmark
- AInsteinBench is a large-scale benchmark designed to assess LLM agents' ability to perform authentic, end-to-end scientific software development tasks.
- It evaluates feature implementation and bug-fixing tasks while ensuring the preservation of physical invariants and numerical accuracy in high-performance computing environments.
- Developed from high-impact open-source codebases and rigorous test methodologies, it highlights LLM limitations in synthesizing cross-file scientific logic.
AInsteinBench is a large-scale benchmark designed to rigorously evaluate LLM agents functioning as scientific computing development agents in authentic, production-grade scientific software environments. Unlike traditional scientific reasoning benchmarks focused on conceptual knowledge or standard software engineering benchmarks that emphasize routine bug fixing and feature implementation, AInsteinBench is formulated to assess the capacity of LLM agents to undertake end-to-end scientific software development—encompassing the implementation of new scientific modules, the diagnosis and repair of numerical or physical defects, and the preservation of scientific and numerical correctness within heterogeneous, performance-critical research codebases (Duston et al., 24 Dec 2025).
1. Motivation and Context
Contemporary scientific research increasingly relies on large, multidisciplinary, and performance-sensitive codebases—including, for example, density-functional theory packages, quantum-chemistry toolkits, and fluid dynamics solvers. Existing scientific reasoning benchmarks, such as GPQA, Scieval, and TPBench, predominantly target conceptual understanding and do not involve interaction with real-world repositories or demand preservation of physical and numerical invariants. Conversely, benchmarks in software engineering, such as SWE-Bench and Multi-SWE-Bench, primarily address generic bug fixes, often within monolingual and application-agnostic code, and typically underrepresent challenges crucial for scientific computing, such as multi-language stack integration, high-performance computing (HPC) patterns, numerical stability, and adherence to domain-specific conventions.
AInsteinBench directly addresses this gap by establishing a testbed in which LLMs are required to function as credible agents within the demands of actual scientific computing development, spanning the implementation of physically meaningful features and the resolution of subtle numerical failures in authentic scientific software repositories (Duston et al., 24 Dec 2025).
2. Benchmark Construction and Repository Selection
AInsteinBench is constructed from rigorous selection and analysis of six high-impact open-source scientific codebases, chosen for their disciplinary breadth, active maintenance, and robust test coverage. The selected projects represent:
- PySCF (quantum chemistry)
- Qiskit (quantum computing)
- OpenMM (molecular dynamics)
- Einstein Toolkit (numerical relativity)
- AMReX (adaptive mesh refinement, fluid dynamics)
- RDKit (cheminformatics)
Repositories are selected according to criteria including community impact, technological diversity (including mixed Python/C/C++/Fortran stacks), evidence of active development, and the presence of self-contained build and test workflows.
Task Sourcing and Curation
Tasks are sourced by:
- Crawling maintainer-authored pull requests via the GitHub API.
- Extracting before/after code states, corresponding test updates, and detailed PR descriptions.
- Filtering PRs to include only those addressing substantive scientific or numerical issues, with corresponding, merged test updates that create a reproducible fail-to-pass transition.
- Reconstructing the precise software environment using Docker, replicating dependency stacks at specific commits, and verifying reproducibility.
- Filtering test suites to correct for under- and over-coverage, thereby refining ground-truth correctness signals.
- Supplementing with automatically synthesized tasks (notably, for advanced Einstein Toolkit modules) that require implementing features validated via simulation-driven, physical-behavior-based tests rather than strict code matching.
- Involving domain experts for final vetting, task clarity, test audit, and assignment of empirical difficulty tiers.
3. Task Taxonomy and Difficulty Calibration
AInsteinBench distinguishes two principal task types:
- Feature-implementation tasks, which require the addition of new scientific modules (such as PDE source terms or new APIs).
- Bug-fixing tasks, demanding resolution of errors that compromise physical or numerical correctness (e.g., violations of conservation laws, incorrect invariants, or improper domain conventions).
Task difficulty is quantified along two axes (each scored 1–5):
- Scientific Depth: The degree of domain-specific expertise required, ranging from trivial to advanced research-level understanding.
- Engineering Complexity: The implementation challenge, from simple tweaks to multi-file, cross-API changes.
The composite difficulty score for each task is computed as:
Difficulty tiers are defined as:
- Easy:
- Medium:
- Hard:
Test coverage is meticulously validated. Under-coverage (false positives) prompts test strengthening or regression-case addition; over-coverage (false negatives) leads to the removal of prescriptive checks or loosening of tolerances.
4. Evaluation Methodology
AInsteinBench evaluates agent submissions within isolated, containerized environments mirroring the exact build chain, runtime dependencies (including MPI/GPU where feasible), and repository state. Agents interface with the environment through a restricted shell, granting capabilities to view, edit files, execute test suites, and commit changes.
Agent Protocol
A plan–act loop (CodeAct framework) orchestrates agent behavior:
- Inspect code and test results for failure signals.
- Synthesize edit proposals with context awareness.
- Iterate until success (all local tests passed) or exhaustion of a step/iteration budget.
Core Evaluation Metrics
Formally defined metrics include:
- Solve Rate (SR):
- Localization Accuracy (LocAcc):
where "required files" originate from the gold PR patch.
- Efficiency (E):
- (Optional) Precision/Recall on file localization:
5. Experimental Results and Error Analysis
The benchmark includes a comprehensive evaluation of both proprietary (e.g., GPT-4o, GPT-5.1/5-high, Claude 4.5, Grok-4, Gemini 2.5/3, Doubao 1.8) and open-source (Kimi K2, GLM 4.6, DeepSeek V3) coding agents. Among these, Claude 4.5 Opus demonstrates the highest overall solve rate at 44.0%; GPT-5.x models achieve 32–35%, with GPT-4o attaining 7.0%. Domain variation is nontrivial, with quantum-chemistry (PySCF) and numerical relativity (Einstein Toolkit/AMReX) tasks most tractable, and cheminformatics (RDKit) and parameter-rich AMReX tasks exhibiting the lowest agent solve rates.
Difficulty-stratified performance indicates:
- Easy: ~50–60%
- Medium: ~30–40%
- Hard: ~10–30%
Frontier LLM agents show localization accuracy exceeding 70%, suggesting that file search/navigation is not the principal bottleneck. Rather, the synthesis and propagation of consistent scientific logic across multiple files—particularly in AMReX (which has a median of 5 required files per task)—remains a primary unsolved challenge.
Agent efficiency analysis shows successful solutions clustering within 20–60 plan–act iterations; extended episode length beyond this window yields diminishing returns, with most additional steps corresponding to nonproductive effort.
Scientific Failure Modes
Qualitative error analysis (see Table below) reveals distinct scientific failure patterns not commonly captured by conventional software benchmarks:
| Failure Mode | Description |
|---|---|
| Mathematics | Plausible code missing analytic subtleties (e.g., omitted limit terms, wrong sign) |
| Conservation Laws | Fixes violating strict physical invariants (particle/energy non-conservation) |
| Spatial Reasoning | Inadequate reasoning about 3D/global code structure (e.g., box padding errors) |
| Scientific Convention | Neglect of domain-specific conventions (e.g., phase treatment in quantum tasks) |
| Scientific Knowledge | Superficial test-passing solutions that discard critical domain constraints |
Illustrative tasks include: the neutralizing plasma term fix in OpenMM, Solovay–Kitaev global-phase repair for Qiskit, Misner wormhole metric implementation in Einstein Toolkit, PySCF Roothaan smearing correction, and order-independent tautomer hashing in RDKit.
6. Contributions, Limitations, and Future Directions
AInsteinBench establishes scientific computing development as an evaluation paradigm distinct from generic software engineering, emphasizing correctness as encoded by physical laws, numerical stability, and adherence to scientific conventions. The benchmark comprises 244 instances, each grounded in maintainer PRs or expert-synthesized tasks, spanning a representative cross-section of scientific domains with verified, test-driven scoring and calibrated difficulty levels.
Key experimental findings expose current LLM limitations in cross-file scientific logic synthesis, geometric and spatial inference, and strict adherence to physical invariants. Notable limitations include persistent imperfections in test oracles (false positives/negatives unresolved by auditing), overrepresentation of well-tested repositories, exclusion of GPU-intensive or monolithic scientific codes, and computational constraints restricting task complexity to CPU-viable test cases.
Future research directions include adapting the benchmark framework for heavier HPC environments, enabling long-running simulations, integrating richer multi-agent or tool-coupled workflows, automating the generation of more rigorous scientific oracles, and establishing a community-facing public leaderboard with continuous integration for real-time tracking of domain-aware coding agent progress (Duston et al., 24 Dec 2025).