Formal Definition and Implementation of Reproducibility Tenets for Computational Workflows (2406.01146v2)
Abstract: Computational workflow management systems power contemporary data-intensive sciences. The slowly resolving reproducibility crisis presents both a sobering warning and an opportunity to iterate on what science and data processing entails. The Square Kilometre Array (SKA), the world's largest radio telescope, is among the most extensive scientific projects underway and presents grand scientific collaboration and data-processing challenges. In this work, we aim to improve the ability of workflow management systems to facilitate reproducible, high-quality science. This work presents a scale and system-agnostic computational workflow model and extends five well-known reproducibility concepts into seven well-defined tenets for this workflow model. Additionally, we present a method to construct workflow execution signatures using cryptographic primitives in amortized constant time. We combine these three concepts and provide a concrete implementation in Data Activated Flow Graph Engine (DALiuGE), a workflow management system for the SKA to embed specific provenance information into workflow signatures, demonstrating the possibility of facilitating automatic formal verification of scientific quality in amortized constant time. We validate our approach with a simple yet representative astronomical processing task: filtering a noisy signal with a lowpass filter using CPU and GPU methods. This example shows the practicality and efficacy of combining formal tenet definitions with a workflow signature generation mechanism. Our framework, spanning formal UML specification, principled provenance information collection based on reproducibility tenets, and finally, a concrete example implementation in DALiuGE illuminates otherwise obscure scientific discrepancies and similarities between principally identical workflow executions.