StreamFlow: Hybrid Workflow Execution
- StreamFlow is a workflow management approach that couples DAG-based logic with declarative environment specifications to enable execution across heterogeneous platforms.
- It supports hybrid, multi-container deployments and policy-driven data transfers, ensuring reproducibility and efficiency in distributed computing.
- Real-world applications, like single-cell RNA-seq pipelines, demonstrate its capacity to orchestrate complex tasks across disjoint, federated computational sites.
StreamFlow is an approach to workflow execution that facilitates the deployment and management of scientific workflows across heterogeneous computational environments, including both cloud-based and high-performance computing (HPC) platforms. Its architectural and functional advances enable coordinated execution of workflows on multiple, potentially disjoint sites that do not share a common data space, addressing core challenges in scientific computing, bioinformatics, and data-intensive application scenarios.
1. Execution Model and Design Principles
StreamFlow distinguishes itself from traditional workflow management systems (WMS) by coupling the workflow graph (typically represented as a Directed Acyclic Graph, or DAG) with a declarative specification of the execution environment. This dual specification enables:
- Hybrid Execution: Each workflow step can be assigned to arbitrary combinations of computational backends, such as cloud clusters (e.g., Kubernetes), on-prem HPC (batch clusters), or multi-container setups (e.g., Docker Compose).
- Atomic Environments: Workflow tasks may execute within multi-container “models” that encapsulate complex software stacks, moving beyond the classic one-container-per-task mapping.
- Explicit Data Management: In contrast to systems that presume globally shared filesystems, StreamFlow orchestrates input and output data transfers with policy-driven rules, ensuring correct function in federated or disconnected environments.
- Modular Orchestration: The framework employs “connectors” abstracting the specifics of job submission and management for distinct orchestration engines, permitting extensibility and portability.
The core system is composed of several interacting components: DeploymentManager
(manages environment lifecycles), Scheduler
(task/resource matcher), DataManager
(tracks and transfers data), and an integration layer (currently with Common Workflow Language, CWL).
2. Architectural Workflow and Algorithms
A StreamFlow-managed workflow proceeds through these generalized steps:
- Specification: The workflow is described in a coordination language (such as CWL), while platforms, container stacks, and resource requirements are specified in a separate YAML declarative model (commonly named
streamflow.yml
). - Deployment: Upon execution, the workflow manager reads the workflow graph and, for each step, deploys the designated execution environment (model) on the requested backend.
- Scheduling and Dispatch: The
Scheduler
component assigns ready tasks to available services or container instances, taking into account policies for resource utilization and data locality. - Data Preparation: The
DataManager
ensures all required input data is present at the execution site by performing direct or staged transfers (two-step if needed: source → management node → execution node). - Execution and Monitoring: Workflow tasks execute within their properly-configured environments; upon completion, outputs are registered and their location is updated in the data tracking system.
- Teardown: Once an environment (model) is no longer needed for any scheduled task, StreamFlow can deallocate associated resources to reduce cost and contention.
Key operational requirements enforced by StreamFlow (as given in symbolic form in the original source) include the atomization of model deployment (), unique task-to-environment associations (), guaranteed data access across containers (), and avoidance of redundant data movement ().
3. Execution Environments: Heterogeneity and Declarative Modeling
StreamFlow introduces a separation between workflow logic and environment configuration, allowing for maximum flexibility:
- Declarative Binding: The
streamflow.yml
specification binds workflow steps to named environment services, enabling users to re-target steps to different infrastructure simply by changing the YAML, not the workflow itself. - Multi-Container Support: Environments (“models”) may consist of several coordinated containers, supporting patterns such as Single-Task-Multiple-Containers (STMC) and Multiple-Tasks-Multiple-Containers (MTMC).
- Connectors: Each platform (e.g., Kubernetes, HPC batch scheduler) implements a connector interface, abstracting submission and resource management logic.
This setup facilitates reproducibility and rapid adaptation to platform changes, making it feasible to execute a workflow identically on different clusters or clouds by simply modifying model bindings.
4. Real-World Case Study: Single-Cell Transcriptomic Pipeline
As an illustration, StreamFlow was evaluated on a bioinformatics pipeline for single-cell RNA-seq, comprising:
- Data conversion and demultiplexing (CellRanger suite, requiring parallel/BAM tools).
- Alignment and quantification (STAR/CellRanger).
- Downstream statistical analysis (Seurat, SingleR in R).
Two distinct models were employed:
- Model A (for early steps): CellRanger/STAR in a container, deployed on HPC cluster.
- Model B (for later steps): R-based analysis containers, potentially on cloud-Kubernetes.
Subjects (single-cell samples) were processed as parallel workflow “splits”, each traversing the above steps.
Execution configurations and outcomes:
- Full HPC: All tasks run in containers on an HPC cluster sharing NFS/Lustre for data—minimal overhead, ~3 hour workflow.
- Hybrid HPC/Cloud: Upstream data-intensive stages executed at HPC, then containerized R analysis executed on Kubernetes cloud, transferring only lightweight result files—no runtime penalty compared to monolithic execution.
StreamFlow’s overhead, notably for environment bootstrapping and data shuttling, was found negligible compared to total task time, validating its effectiveness for complex multi-site deployments.
5. Innovative Capabilities and Differentiation
StreamFlow's principal innovations include:
- Cross-environment orchestration without requiring a unified, global data space.
- Atomic, multi-container environment deployment, enabling sophisticated service patterns and real application parity with modern containerized and cloud-native stacks.
- Policy-driven data management that avoids unnecessary movement and duplication, essential for performance in distributed environments.
- Environment portability and workflow repurposing at declarative binding granularity.
- Support for complex, parallel, and federated workflow execution patterns, unachievable or difficult to manage in traditional one-task/one-container models.
6. Potential Applications and Future Work
Beyond the bioinformatics use case, StreamFlow is applicable wherever complex, reproducible, multi-service workflows must be executed across hybrid environments. Notable domains include:
- Large-scale scientific simulations (e.g., climate, genomics) requiring federated or partitioned computation.
- Distributed machine learning (training/inference) on hybrid cloud–HPC resources.
- Data analytics or privacy-sensitive workflows where compliance and data locality require distributed, policy-governed execution.
- Edge-cloud pipelines handling remote and data movement–constrained scenarios.
Planned extensions include support for alternative coordination languages beyond CWL, connectors for additional orchestration engines (SLURM, commercial clouds), enhanced inter-container data/communication abstractions, and formal expression of MTMC scheduling/deployment. Improved dynamic resource management and a range of production features (preemption, UI, scaling, community support) are also targeted.
7. Impact on Workflow Management and Scientific Computing
StreamFlow’s hybrid workflow paradigm advances the state of workflow management by enabling transparent, declarative, and resource-efficient execution over a diverse and evolving computing landscape. It provides a systematic, modular, and reproducible foundation for data-intensive, parallelizable, and portable workflows critical to contemporary scientific and analytics workloads. As computational environments diversify, StreamFlow’s design principles and abstractions offer a robust path forward for scalable, flexible, and federated scientific workflow execution.