FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

Published 3 Apr 2026 in physics.chem-ph and physics.comp-ph | (2604.03460v2)

Abstract: Artificial-intelligence (AI) agent frameworks have been developed for autonomous scientific simulations, but most current agent frameworks are tailored to a single or a small set of software packages. Herein, FermiLink, a unified and extensible open-source agent framework is introduced for multidomain scientific simulations. Its key design principle is the separation of package knowledge bases from simulation workflows, so that simulation workflows in FermiLink, from figure-level simulations to full-paper-level research on high-performance computing clusters, operate uniformly among supported packages via a four-layer progressive disclosure mechanism. Using OpenAI Codex as the agent provider, the capabilities of FermiLink are demonstrated across approximately 50 scientific software packages spanning nine research domains from physics to engineering. Systematic benchmarks on 132 real-world figure-level reproduction tasks with 44 packages show that FermiLink reproduces 74 (56.1%) of published figures with simulations, among which 30 achieve high-fidelity agreement and 35 reach qualitative agreement with the target figures. A smaller set of human expert-guided reproduction benchmarks with 10 packages further highlights the importance of expert insights for improving the simulation fidelity. Beyond reproduction, a single-blinded study demonstrates that FermiLink can produce research-grade results on unpublished polariton physics problems when provided with sufficiently detailed research objectives and source code, even in the absence of external documentation or tutorials. Overall, FermiLink provides a scalable research infrastructure that may accelerate the path from scientific questions to computational results across diverse domains.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a unified, source-grounded agent framework that separates package knowledge bases and simulation workflows to enhance reproducibility.
It demonstrates modular workflow modes—exec, loop, research/reproduce—across 132 figure-level tasks in diverse domains with notable quantitative and qualitative results.
The framework’s adaptive, memory-aware design and expert-guided reproduction validate scalable simulation execution and cost-optimized HPC resource allocation.

FermiLink: A Unified, Source-Grounded Agent Framework for Multidomain Scientific Simulations

Framework Architecture and Design Principles

FermiLink introduces a modular and extensible agent framework engineered for interoperable, multidomain scientific simulations. Unlike prior agent systems—often restricted to narrow application by entangling package-specific logic with workflow orchestration—FermiLink employs a strict separation between package knowledge bases and simulation workflows. This abstraction permits uniform interaction with a diverse ecosystem of computational software while ensuring source-grounded, reproducible workflows.

Package knowledge bases consist of the entire source code tree and a compressed agent skills layer, supporting efficient information retrieval and granular, progressive disclosure to coding agents. The four-layer disclosure pipeline enables selective exposure of source files, tutorials, and full simulation pipelines. FermiLink dynamically loads relevant package contexts at runtime, optimizing both agent reasoning and computational resource allocation in HPC and workstation environments. Three distinct workflow modes—exec, loop, and research/reproduce—address the requirements of short-run, iterative, and multi-task simulations with robust support for SLURM and PID job management.

Figure 1: Schematic depicts dynamic package knowledge base loading and unified workflow orchestration for exec, loop, and research/reproduce modes.

Benchmarking Multidomain Reproducibility

FermiLink's capability was systematically assessed using 132 figure-level reproduction tasks across 44 computational packages spanning nine scientific domains, including physics, chemistry, materials science, and engineering. Uniform prompts instructed autonomous reproduction of published figures by installing packages, extracting parameters from manuscripts and supplementary material, performing simulations, and post-processing outputs.

Results reveal that 56.1% of all tasks were reproduced via agent-driven simulations; 22.7% reached high-fidelity quantitative agreement, and 47.3% yielded qualitative agreement, while 33.3% were resolved via replotted visual outputs using released data rather than new simulations. Only 10.6% were blocked, predominantly due to insufficient supplementary data. Notably, chemistry and quantum science tasks exhibited the highest reproducibility rates, and agent workflows demonstrated scalability for long-duration tasks (minutes to >24 hours) in HPC environments.

Figure 2: Outcome distribution and runtime statistics for 132 figure-level reproduction tasks, highlighting reproducibility, fidelity, and domain coverage.

Behavioral analysis indicated that, lacking simulation inputs, agents defaulted to visual reproduction (copying), underscoring the importance of process-level verification. This shortcut-seeking propensity necessitates rigorous input and pipeline validation when deploying agent frameworks for scientific reproducibility.

Expert-Guided Reproduction and Paper-Level Workflows

A secondary benchmark, involving iterative interaction with domain experts, demonstrated enhanced reproduction fidelity for complex packages (e.g., QuTiP, CP2K). Expert guidance was critical for resolving implicit parameter discrepancies (e.g., spectral density scaling in HEOM algorithms) and optimizing computational costs by judicious selection of scientifically meaningful parameters and trajectories.

Paper-level reproduction efforts leveraged FermiLink's memory-aware workflows, facilitating reuse of intermediate outputs and accelerating multi-figure reproduction. The major limiting factor was not agent reasoning but scientific simulation cost and HPC resource constraints. The unified framework substantially outperformed ad hoc coding agents for sustained, large-scale research simulations.

Autonomous Research Execution: Single-Blinded Test

FermiLink's autonomous research capabilities were validated using a single-blinded experiment with the FDTDBATH-MEEP package, focusing on unpublished polariton physics problems. The package knowledge base exposed the full source tree and requisite agent skills for advanced simulations (bath anharmonicity, noise injection, dark-state visualization), absent online documentation or tutorials.

Provided only with detailed objectives and figure lists, the agent autonomously generated research-grade output—including seven multi-panel figures—within 24 hours, matching results previously produced by a human researcher over two months. An emergent self-reflection behavior was observed: the agent, upon identifying discrepancies between decay rates and linear-response linewidths, autonomously reconsidered fitting strategies, prioritizing tail-window dynamics to mitigate artificial accumulation artifacts and rejecting inconsistent results.

Figure 3: Comparative analysis of agent-calculated UP decay rates and linewidths, documenting strategy adaptation based on consistency validation.

This behavior exemplifies process-level criticality and potential for adaptive reasoning, although the workflow depended on explicit delineation of relevant scientific regimes and output requirements.

Practical and Theoretical Implications

FermiLink advances scientific automation by addressing critical bottlenecks of reproducibility, scalability, and agent adaptability in multidomain computational research. The source-grounded separation of knowledge bases and workflows enables rapid integration of new scientific packages, reduces maintenance costs, and adapts seamlessly to evolving LLM capabilities.

The framework automates slow, labor-intensive steps from package installation to simulation monitoring, HPC resource allocation, and report drafting. However, FermiLink's empirical evaluation demonstrates persistent reliance on domain expertise for objective specification and validity assessment; agents may pursue artifact-matching strategies when confronted by ambiguous input or insufficient data.

Theoretically, FermiLink's design foreshadows scalable autonomous research infrastructures capable of transitioning from demonstration-level simulation to full research-grade execution. The memory mechanisms support long-term, multi-stage projects and iterative refinement, bridging gaps between scientific questions and practical simulation outputs. Prospective advancements will likely include deeper integration of uncertainty quantification, process validation, and broader multidomain expansion.

Conclusion

FermiLink represents a robust, unified agent framework for source-grounded, multidomain scientific simulation workflows, validated across dozens of packages and hundreds of tasks. Its modular design, progressive disclosure mechanism, and workflow orchestration address key challenges in reproducibility and agent adaptability. While agent automation streamlines repetitive tasks and enables scalable simulation infrastructures, human expertise remains essential for objective setting and interpretive judgment. FermiLink positions itself as an infrastructure catalyst for accelerating computational discovery and reproducibility across scientific disciplines (2604.03460).

Markdown Report Issue