AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance (2003.12476v1)

Published 24 Mar 2020 in cs.DC and cond-mat.mtrl-sci

Abstract: The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (http://www.aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with any simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.

Citations (213)

View on Semantic Scholar

Summary

The paper introduces an event-driven engine that improves task distribution and scalability for high-throughput computations.
The paper presents a robust provenance graph that meticulously tracks computational processes, ensuring reproducibility and data integrity.
The paper details database optimizations using PostgreSQL’s JSONB format to enhance query performance while reducing storage overhead.

Overview of AiiDA 1.0: A Scalable Computational Infrastructure for Automated Reproducible Workflows and Data Provenance

The article introduces AiiDA 1.0, an open-source infrastructure designed to facilitate high-throughput computations with a focus on automation and data provenance. AiiDA provides a robust framework capable of managing complex workflows, ensuring reproducibility by recording detailed provenance data of computational processes. This paper outlines the development journey and architectural overhaul of AiiDA, specifically aimed at addressing the challenges presented by exascale computing systems.

Key Features and Improvements

AiiDA 1.0 introduces significant architectural improvements, enhancing its capability to manage high-throughput workloads seamlessly. The primary changes include:

Event-Based Engine: The transition from a polling-based to an event-driven architecture marks a pivotal development. Utilizing RabbitMQ for message brokering, AiiDA ensures efficient task distribution across numerous processes, thereby enabling instantaneous reactions to state changes within workflows.
Provenance Graph: AiiDA’s provenance framework is designed to track both logical and data provenance within computations. This framework establishes a structured, queryable database that meticulously documents input-output relationships and process dependencies using separate, directed acyclic graphs.
Database Optimization: Transitioning from an entity-attribute-value storage model to PostgreSQL’s JSONB format enhances query efficiency and reduces storage demands, alongside a more dynamic computation of transitive closures at query time, mitigating storage bloat while maintaining query performance.
Plugin System and Interoperability: A workflow-centric plugin system extends AiiDA’s functionality, facilitating integration with a wide range of simulation codes and allowing for the development of custom data types and workflows which are made publicly sharable via an online registry.

Implications for Scientific Computing

By catering to the intensive computational demands of modern scientific research, AiiDA 1.0 makes a compelling case for automated workflow management in computational science. The sophisticated approach to data provenance not only ensures reproducibility but also enables efficient use of computational resources through caching and scalable process management.

The practical implications include:

Enhanced Reproducibility: The automated tracking of computations enhances the integrity and verification of scientific results, a pillar in ensuring credible scientific discourse.
Future-Proof Infrastructure: The emphasis on extensibility and interoperability positions AiiDA as a central framework that can evolve alongside scientific needs, facilitating collaboration and data sharing within and across research disciplines.
Performance Scaling: The breakaway from legacy design limitations allows AiiDA to support next-generation supercomputing, accommodating the explosion of data and computation required in fields such as materials science.

In conclusion, AiiDA 1.0 represents a strategic advancement in tackling foundational challenges in computational science—namely, reproducibility, scalability, and interoperability. As the landscape of scientific research increasingly leans on computational methods, frameworks like AiiDA will play an essential role in supporting advanced data-driven discovery processes. Further developments and widespread adoptions of such infrastructures could expedite scientific innovation while maintaining the rigor and reproducibility that contemporary science demands.

PDF Markdown

AiiDA 1.0, a scalable computational infrastructure for automated reproducible workflows and data provenance (2003.12476v1)

Summary

Overview of AiiDA 1.0: A Scalable Computational Infrastructure for Automated Reproducible Workflows and Data Provenance

Key Features and Improvements

Implications for Scientific Computing

Related Papers