Bridging the Gap on AI-Assisted Scientific Software Development Through Transparency and Traceability

Published 17 May 2026 in cs.SE and cond-mat.mtrl-sci | (2605.17675v1)

Abstract: The widespread adoption of AI-assisted development in scientific software is not a future concern -- it is a present reality. Researchers are already using LLMs to write code, generate test cases, and draft documentation, yet this practice remains largely unacknowledged and unguided in formal workflows and published work. This ad hoc, ungoverned use of AI represents a systemic risk to scientific software quality, particularly in safety-relevant modeling and simulation tools subject to strict Software Quality Assurance (SQA), or even Nuclear Quality Assurance Level 1 (NQA-1) standards, for which traceability, independent verification, and documented procedures are paramount. The question facing the scientific software community is, therefore, not whether to permit AI-assisted development, but how to govern it responsibly. This paper proposes guidance for AI-assisted code development in the context of strict software quality assurance. Using TMAP8 -- an open-source tritium migration code for fusion energy -- as a demonstration platform, we propose a structured framework for AI-assisted verification and validation (V&V) case development. V&V case development represents the ideal proving ground for establishing that governance: because validation cases have known solutions, correctness is objectively measurable, errors are identifiable by design, and the artifacts are fully auditable. The proposed guidance, developed based on practical experience described herein, operates within NQA-1 requirements, preserves human accountability, and establishes the disclosure and review standards that responsible AI-assisted scientific software development demands.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a novel governance framework integrating explicit metadata, provenance, and dedicated human oversight to manage AI-assisted software development.
It demonstrates framework viability through TMAP8 case studies, highlighting reduced RMSPE and improved reproducibility in safety-critical domains.
The study emphasizes that rapid AI-driven code generation must be coupled with rigorous SQA and adaptive policy enforcement to maintain scientific rigor.

Governance for AI-Assisted Scientific Software Development: Transparency, Traceability, and SQA Integration

Introduction

The incorporation of LLMs and agentic AI systems into scientific software development pipelines is rapidly transforming established practices. In regulated environments such as those governed by Software Quality Assurance (SQA) protocols—especially fields adhering to the ASME NQA-1 standard—this transformation introduces substantial systemic risks if left unmanaged. The paper "Bridging the Gap on AI-Assisted Scientific Software Development Through Transparency and Traceability" (2605.17675) addresses these risks, presenting a structured governance framework for AI-assisted scientific code development, anchored in the principles of transparency, traceability, and rigorous validation.

Motivations and Problem Statement

The ad hoc, frequently undisclosed use of LLMs and agentic systems in scientific workflows raises critical questions regarding provenance, defect propagation, and reproducibility, directly impacting safety-relevant domains including nuclear modeling and fusion energy simulation. Empirical evidence documents elevated defect rates, logical and semantic errors ("hallucinations"), and reproducibility failures in AI-generated code, with further risk amplification due to correlated failure modes when both implementation and testing code are produced by related models [14, 15, 16].

Traditional SQA and NQA-1 frameworks presuppose human authorship and review, creating a governance vacuum for AI-in-the-loop workflows. Overly restrictive policies risk driving AI usage underground, while unregulated adoption erodes trust and undermines the reproducibility and safety of scientific software artifacts.

Proposed Governance Framework

The framework operationalizes the following key principles:

Explicit Metadata and Provenance: All code contributions must record the degree and nature of AI involvement at the commit level, with links to development rationale and session logs. Session logs are required to be human- and machine-readable, enabling stochastic reconstruction of AI-developer interactions.
Independence and Human Accountability: Consistent with NQA-1 and SQA best practices, AI agents are relegated to an assistive role; only human reviewers are qualified to perform independent verification. The human principal remains the accountable author of any contribution, irrespective of the extent of mechanical generation.
Automation and Infrastructure Support: Non-negotiable requirements—such as session log inclusion, metadata completeness, and adherence to repository contribution guidelines—are enforced using automated pre-commit hooks, rather than left to the discretion or context window limits of LLM agents.
Iterative Policy Artifact (AGENTS.md): Governance requirements are encoded in an infrastructure-level AGENTS.md file that is version-controlled alongside the codebase, treated as a living document, and updated periodically as new agentic failure modes are observed.
V&V Case as the Test Bed: Verification and validation (V&V) test cases provide a domain of objective, measurable correctness criteria on which to benchmark both AI- and human-generated artifacts, serving as a proving ground for policy refinement in a low-risk, high-auditability environment.

Demonstration: TMAP8 Validation Cases

The framework's viability is substantiated through two V&V case studies in TMAP8:

Case 1: Implementation of an Established Model

A mechanistic model for tritium release from neutron-irradiated Li₂TiO₃ [45] is transcribed and parameterized within TMAP8 using a Claude/Opus-based agentic workflow. The AI system pre-processes literature, extracts governing equations, proposes implementation plans, and automates repetitive tasks including Bayesian calibration and test artifact construction. The human developer functions as prompter and reviewer, guiding physical model selection and parameter range specification.

A key observation is the agent's faithful reproduction of even reference publication errors—notably, initial omission of a defect-annihilation term—highlighting the epistemic necessity of adversarial human validation beyond automated checks. Bayesian parameter optimization yields a reduction in RMSPE from 23.14% (reference parameters) to 9.20% (optimized), validating the accelerated execution of complex calibration routines by AI while emphasizing that scientific oversight cannot be automated away.

Case 2: Hypothesis-Driven Model Construction

Leveraging experimental data on deuterium release from self-irradiated tungsten with oxide films [46], a Codex/ChatGPT-based AI agent supports construction of a novel model formulation—here, the absence of a published mechanistic model necessitates genuine scientific inference and exploration rather than transcription. The agent is prompted to hypothesize mechanisms, prototype competing formulations (e.g., distinct versus continuous oxide layer treatments), run iterative V&V cycles, and automate documentation and plotting.

The agent's ability to accelerate the exploration of a hypothesis space—not merely the mechanical translation of known models—enables prototype-comparison cycles that would be cost-prohibitive or impractical in a purely human workflow. RMSPEs in D₂ release rates across oxide film configurations range from 26.61% to 46.45%, demonstrating reasonable physical fidelity for a first-principles driven approach.

Lessons Learned and Refined Best Practices

AI as an Accelerator, Not an Autonomous Developer

Substantial productivity gains are observed (e.g., reducing development timelines from days to hours), but the cognitive burden shifts: task-level costs fall, while the cost of global scientific oversight, hypothesis selection, and interpretation increases. Agents tend toward over-complication and may drift or hallucinate unless constrained by explicit repository policies and session scoping.

Hallucination, Context Limitations, and Enforcement

LLM "hallucination" manifests in subtle, plausible coding errors, often mirroring ambiguities or defects in the scientific literature itself. Agentic systems' limited context windows amplify the risk of missed requirements, necessitating that provenance and validation enforcement be infrastructural (pre-commit hooks, protected branches), rather than instruction-level or dependent on agent recall.

Provenance and Long-Term Ecosystem Resilience

Explicit, machine-readable provenance is essential—not only for present quality assurance, but to mitigate long-term risks of model collapse and defect amplification as AI-generated artifacts and failure patterns are recursively incorporated into training sets [52]. Repositories must provide mechanisms for provenance filtering and auditability.

Theoretical and Practical Implications

The framework demonstrates that existing SQA and NQA-1 principles are largely compatible with AI-assisted development when human agency and accountability are structurally preserved, and systematic enforcement mechanisms are employed. The key boundary is the irreducibility of human epistemic oversight in scientific judgment and the design of validation regimes for high-assurance domains. As agentic systems become more deeply embedded in scientific workflows, policy and standards bodies must address whether to treat LLMs and coding agents as configuration items with their own qualification and versioning requirements.

Iterative governance artifacts such as AGENTS.md enable adaptation as agentic tools and best practices evolve, and provide a foundation for broader application across scientific software ecosystems as well as other regulated domains.

Conclusion

The paper establishes that the binary choice between AI-assisted and human-only scientific software development is obsolete; the urgent distinction is between governed and ungoverned AI involvement. By integrating explicit provenance, independent review, and infrastructure-level enforcement, the proposed framework preserves the foundational requirements for traceable, auditable, reproducible scientific computing, even as code production accelerates and the locus of creativity shifts toward prompt engineering and review.

V&V cases provide an effective domain for initial policy refinement and demonstration, with lessons directly extensible to broader applications. As AI-generated code is increasingly incorporated into open-source and scientific repositories, robust governance becomes not merely a matter of immediate quality assurance but a bulwark against the long-term degradation of scientific and engineering software ecosystems.

The primary implication is clear: responsible, standards-compliant AI-assisted software development is achievable today, without sacrificing productivity or scientific rigor—provided that explicit, evolving governance artifacts and enforceable infrastructure safeguards are in place.