ExaWorks SDK: Exascale Workflow Toolkit
- ExaWorks SDK is a modular toolkit unifying heterogeneous workflow engines, resource managers, and execution systems to manage exascale scientific workflows.
- It enables portable and scalable workflow strategies through API-driven abstractions and a layered design that supports diverse scheduling systems.
- Demonstrated at DOE Leadership Facilities, the SDK achieves high resource utilization, robust fault tolerance, and streamlined deployment across HPC environments.
The ExaWorks Software Development Kit (SDK) is a modular, composable collection of interoperable workflow technologies devised to provide portable, scalable workflow management services across Department of Energy (DOE) Leadership Class Facilities (LCFs) and other high-performance computing (HPC) environments. Conceived as a "toolkit of toolkits," the SDK unifies a wide array of workflow engines, pilot-job systems, resource managers, and standardized interfaces under a common abstraction layer. ExaWorks enables users to author, adapt, and execute heterogeneous scientific workflows—combining simulation, data analysis, machine learning, and resource coordination—at exascale. Design priorities include interoperability, code reuse, continuous integration, and performance portability. The SDK is a major output of the U.S. Exascale Computing Project, and is engineered from the outset for extreme scale, heterogeneous resource targets, and open standards-based sustainability (Turilli et al., 2024, Alsaadi et al., 2024, Titov et al., 2024, Al-Saadi et al., 2021).
1. Architecture and Layered Design
ExaWorks SDK is structured as a multilayered stack, where each layer supports interchangeable and pluggable components. The conceptual layers are:
- Workflow-composition layer: High-level APIs for user-space description of directed-acyclic-graph (DAG), dataflow, ensemble, or machine learning–steered workflows. Example components include Parsl, RADICAL Cybertools (e.g., RADICAL-EnTK), Swift/T, SmartSim, and MaestroWF.
- Resource-management abstraction layer: Common abstractions encapsulated in interfaces such as PSI/J (Portable Submission Interface for Jobs) and provider/launcher APIs. These decouple workflow logic from scheduler-specific details.
- Execution-management layer: Pilot-job systems and hierarchical schedulers enable fine-grained placement, dynamic resource partitioning, and high-throughput task management. Notable engines are RADICAL-Pilot and Flux.
- HPC-integration layer: Backends for platform-native schedulers (Slurm, PBSPro, LSF, Cobalt, Flux) and support for container runtimes, network fabrics, and data staging facilities (Turilli et al., 2024, Alsaadi et al., 2024).
Across these layers, well-defined interfaces and adapters support cross-component interoperability (e.g., Parsl+Flux, Parsl+PSI/J, RADICAL-Pilot+Flux). Figure 1 in (Turilli et al., 2024) and the corresponding ASCII sketches in (Alsaadi et al., 2024) illustrate this reference architecture, which facilitates mix-and-match assembly of middleware for science teams.
| Layer | Example Components | Interface Standard |
|---|---|---|
| Workflow composition | Parsl, EnTK, Swift/T, MaestroWF | Python API, Dataflow |
| Resource management | PSI/J, Flux | JobSpec, Provider API |
| Execution management | RADICAL-Pilot, Flux | TaskSched., PilotJob |
| HPC integration | Slurm, PBS, LSF, Cobalt | Scheduler APIs |
2. Component Technologies and Abstractions
ExaWorks SDK curates and integrates established workflow, pilot, and resource management systems:
- Flux: A distributed, hierarchical resource manager and scheduler supporting dynamic user-space partitioning, nested scheduling (parent/child instances), and concurrency controls for HPC workloads (Al-Saadi et al., 2021, Turilli et al., 2024).
- RADICAL Cybertools: A suite including RADICAL-Pilot (a pilot-job system for decoupling resource allocation from task execution) and RADICAL-EnTK (an ensemble-based workflow engine utilizing the Pipeline-Stage-Task or PST abstraction). These enable dynamic pipeline execution, cross-scheduler portability, and advanced policies (dynamic task creation, retries, multi-scheduler support) (Titov et al., 2024).
- Parsl: A Python-based dataflow workflow library. Users annotate functions as @PythonApp or @BashApp, which Parsl converts to a DAG of TaskSpecs and dispatches through executor and provider interfaces. It supports backend execution via built-ins or through integration layers such as PSI/J and RADICAL-Pilot (Turilli et al., 2024).
- SmartSim and MaestroWF: Specialized systems for, respectively, ML-augmented simulation campaigns (co-locating in-memory DBs and ML workers) and YAML-based user-space DAG specification (Turilli et al., 2024).
- PSI/J: A minimal, vendor-neutral job management API for portable submission and monitoring of jobs across heterogeneous schedulers. The API is designed with bulk-aware asynchronous operations and state machine models for job tracking (Alsaadi et al., 2024, Al-Saadi et al., 2021).
- Swift/T: A dataflow language and MPI runtime for implicit large-scale parallel workflows, compiled from C/Java-like code (Turilli et al., 2024, Al-Saadi et al., 2021).
Connector and adaptor layers (e.g., FluxExecutor, RPEXExecutor) expose unified APIs, enabling abstract workflow code to target diverse engines and schedulers with minimal modification. Integration examples such as Parsl+Flux, Parsl+RADICAL-Pilot, and Swift/T+Flux exemplify API-driven composition (Turilli et al., 2024, Alsaadi et al., 2024).
3. Interfaces and Interoperability
Interoperability is architected via strict adherence to a set of small, language-neutral abstractions:
- TaskSpec / JobSpec: Canonical descriptors of workflow tasks and jobs, specifying executable, resource requirements (cores, GPUs, nodes, walltime), environment, and arguments.
- Executor / Provider interfaces: Objects or classes exposing
submit(JobSpec) → JobID, status, cancel, bulk_submit, and wait operations—typically asynchronous for high throughput. - PSI/J: Defines core function signatures and job state transitions. For example, the job state machine includes transitions such as , , and (Alsaadi et al., 2024).
Adapters translate between workflow models and execution engines (e.g., Parsl’s Executor and Provider interfaces calling PSI/J’s JobExecutor or RADICAL-Pilot’s API). The connector ecosystem prevents lock-in and supports deep API-driven aggregation across facilities and platforms.
Levels of interoperability are specified from Level 0 (packaging co-location) through Level 2 (deep API composition across systems) (Al-Saadi et al., 2021).
4. Performance, Scaling, and Reliability
ExaWorks SDK and its constituent technologies have demonstrated high scalability and resource efficiency in DOE leadership-class HPC environments:
- RADICAL-EnTK–driven ExaAM UQ campaign on Frontier: Achieved 90% resource utilization on 8,000 nodes (448,000 CPU cores, 64,000 GPUs), orchestrating 7,875 concurrent tasks via a single RP pilot. Measured total execution time ; pilot overhead ; resource utilization (RU) of 0.90; scheduling throughput and launch throughput (Titov et al., 2024, Alsaadi et al., 2024).
- Parsl + RADICAL-Pilot scaling: Demonstrated near-linear weak and strong scaling to 65,536 cores, with ~99% utilization on up to 256 nodes (Turilli et al., 2024, Alsaadi et al., 2024).
- Swift/T workflows: Enabled hundreds of concurrent MPI jobs with low-overhead task launch (e.g., COVID-19 agent-based simulations), consistent with tasks/sec throughput using hierarchical and pilot-job scheduling (Al-Saadi et al., 2021, Alsaadi et al., 2024).
- Fault Tolerance: EnTK and RP support both task-level retry (user-configurable on failure of MPI abort, library error, node drop) and application-level restart (automatic recomputation of incomplete vs. completed tasks if a pilot is preempted), preserving workflow ordering and minimizing manual intervention (Titov et al., 2024).
- Portability: All platform-specific resource details (schedulers, launchers, core/gpu/node mappings) are captured in adaptors, enabling the same workflow scripts to execute unmodified across Summit (LSF), Crusher (Slurm), and Frontier (Slurm) (Titov et al., 2024).
5. Continuous Integration, Testing, and Documentation
Best practices in ExaWorks development enforce rigorous CI, test coverage, and documentation:
- Continuous Integration (CI): Pipelines hosted at major DOE centers (ALCF, LLNL, NERSC, OLCF) run both component-level and cross-component tests daily, with coverage of unit and smoke tests, integration flows, and historical tracking of pass rates (>95%) (Turilli et al., 2024, Alsaadi et al., 2024).
- Testing Dashboard: Backend REST service and frontend dashboard (calendar and drilldown views) catalog test runs, outcomes, and artifacts per site.
- Documentation: Central "SDK hub" on ReadTheDocs with per-component references, API/guides, and governance; dynamically generated and tested Jupyter tutorials (bundled via Docker, accessible through Binder and JupyterHub) serve as hands-on, reproducible pedagogical resources (Turilli et al., 2024).
- Packaging: Spack, pip, conda, Docker, and Singularity recipes ensure ease of installation and deployment across platforms (Al-Saadi et al., 2021).
- Sustainability: Open governance, DOE E4S stack inclusion, community summits, and vendor partnerships underpin long-term viability (Al-Saadi et al., 2021).
6. Application Domains and Representative Workflows
ExaWorks SDK underpins a spectrum of science workflows with heterogeneous, multi-stage, and multi-scheduler requirements. Key exemplars include:
- ExaAM (Exascale Additive Manufacturing): Multistage UQ workflows with stages spanning varied resource requirements (e.g., Stage 1: 4 nodes × 56 CPUs, Stage 2: 1 node × 7 CPU + 1 GPU, Stage 3: 8 nodes × 7 CPU + 1 GPU/task), managed by EnTK and RP with dynamic task placement and environment configuration (Titov et al., 2024).
- CANDLE (Cancer Distributed Learning Environment): Distributed hyperparameter optimization running 1,000+ jobs on leadership systems via Swift/T and Python interoperability (Al-Saadi et al., 2021).
- Colmena: Adaptive AI-steered molecular simulation campaigns leveraging Parsl, PSI/J, and dynamic decision logic, supporting dispatch rates of ~100 tasks/sec on 10,000 nodes (Al-Saadi et al., 2021).
- COVID-19 Simulations: Multi-agent models and ML-augmented workflows with Swift/T and Parsl+RP+Flux ensembles, achieving high concurrency and throughput (Alsaadi et al., 2024).
These case studies show that by encapsulating best-of-breed middleware under minimal, stable abstractions, ExaWorks enables complex, heterogeneous workloads to fully utilize exascale systems without bespoke engineering (Titov et al., 2024, Al-Saadi et al., 2021, Alsaadi et al., 2024).
7. Best Practices, Lessons Learned, and Future Directions
Key best practices in SDK design and operations include:
- Well-defined, versioned APIs for core abstractions (TaskSpec, JobSpec, Executor, Provider).
- Explicit separation of concerns: workflow composition, scheduling, resource management, data staging.
- Containerized documentation and reproducible tutorials.
- High modularity and loose coupling, permitting independent maintenance and evolution of components.
- Funding and community focus on integration layers (adaptors/connectors), not redundant reimplementation.
- Recommendations for DOE and facility workflow services: official CI resources, container-based training sandboxes, community-driven dashboards, and extension of PSI/J and associated abstractions to new languages (Rust, Go, Julia) (Turilli et al., 2024, Alsaadi et al., 2024, Al-Saadi et al., 2021).
A plausible implication is that the connector/adaptor-centric approach adopted by ExaWorks, with rigorous adherence to bulk-aware asynchronous APIs and open standards, constitutes a scalable blueprint for future-generation HPC and data-driven science workflows in heterogeneous environments.
References:
- Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization (Titov et al., 2024)
- ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies (Turilli et al., 2024)
- Exascale Workflow Applications and Middleware: An ExaWorks Retrospective (Alsaadi et al., 2024)
- ExaWorks: Workflows for Exascale (Al-Saadi et al., 2021)