FaaS Workflow Platforms
- FaaS Workflow Platforms are cloud-based systems that orchestrate fine-grained, event-driven functions using DAGs or state machines for scalable and efficient workflows.
- They address challenges like inter-function communication latency, cold start delays, and orchestration overhead through optimized architectural designs.
- They support diverse applications—including scientific workflows, microservices, and data analytics—by effectively balancing cost, performance, and elasticity.
Function-as-a-Service (FaaS) workflow platforms enable developers to compose, execute, and scale workflows of fine-grained, event-driven cloud functions. These platforms abstract away server management and facilitate the orchestration of complex, distributed computations by chaining independent FaaS functions into Directed Acyclic Graphs (DAGs) or state machines. Fundamental characteristics of FaaS workflow platforms—such as elasticity, high concurrency, built-in orchestration, and metered resource usage—make them attractive for scientific workflows, microservices, data analytics, and a growing suite of event-driven applications. However, the design, performance, scalability, and costing behavior of such platforms are subject to distinctive challenges, including inter-function communication overheads, orchestration bottlenecks, cold-start latencies, and the intricacies of platform-specific resource management.
1. Execution and Orchestration Models
FaaS workflow platforms adopt orchestration models that define how functions are composed and triggered. On leading public clouds, orchestrators such as AWS Step Functions and Azure Durable Functions allow developers to declaratively specify workflows as state machines or DAGs, in which each node is a standalone FaaS function (Kulkarni et al., 27 Sep 2025). Workflow execution proceeds by passing the output from one function as the input to the next, with orchestration logic (including error handling and retries) managed by the platform.
The platforms differ in orchestration architecture: AWS Step Functions use built-in message queues (Amazon SQS) and lightweight microVMs (Firecracker) per function; Azure Durable Functions employ orchestrator functions and storage backends (Azure Storage Queues, Blobs, or Event Hubs) to coordinate activity functions (Kulkarni et al., 27 Sep 2025). This induces architectural trade-offs—AWS employs per-function isolation and scaling, while Azure shares worker VMs across multiple functions and relies on explicit concurrency limits (MO, MA) for orchestrator and activity scaling.
Empirical evaluations demonstrate that architectural choices impact workflow performance, orchestration delay, and elasticity. For example, AWS’s per-function scaling delivers predictable launch latency, but orchestration costs (state transition and message passing) dominate billing, accounting for up to 99% of the total cost for workflows with many state transitions (Kulkarni et al., 27 Sep 2025).
2. Inter-Function Communication and Data Handling
Inter-function communication incurs measurable latency and platform-specific variability. In AWS Step Functions, handoffs between functions leverage internal SQS with a strict 256 KB payload limit, providing low and highly predictable inter-function latencies (Kulkarni et al., 27 Sep 2025). By contrast, Azure Durable Functions differentiate between small and large payload transfers: small messages pass through storage queues, while messages exceeding queue limits are persisted as blobs or routed via Event Hubs (in Netherite), introducing steep increases in inter-function latency for large payloads (2×–7×) (Kulkarni et al., 27 Sep 2025). The process of setting up storage objects and reassembling split files for large messages further compounds delays.
The orchestration approach also shapes dataflow: platforms built on task hubs (e.g., Azure) exhibit higher sensitivity to data size and state update overhead compared to systems employing direct function invocation patterns with efficient in-memory handoff or pipelined data sharing (Kulkarni et al., 27 Sep 2025).
3. Cold Start and Scaling Behavior
Cold start phenomena play a pronounced role in FaaS workflow performance. Workflow-level effects arise when multiple functions are invoked for the first time in rapid succession, yielding a cascade of cold starts that manifest as compounded delays along the execution chain (Kulkarni et al., 27 Sep 2025). Specifically, the cold start latency in Azure may consist of: (1) container spin-up, (2) function runtime initialization, and (3) dataflow/storage system setup. Even warm containers may exhibit function-level cold starts (e.g., lazy import of dependencies or model loading for ML workloads).
AWS, using Firecracker microVMs, exhibits cold start overheads <200 ms, amortized across function reuse windows. Azure’s overheads are higher: cold start workflow executions may incur up to 74% higher execution times than warm starts, and heavy parallelism can magnify cold start-related delays by ~50% (Kulkarni et al., 27 Sep 2025). The paper further finds AzN (Netherite) offers better cold start times (51.5 ms) than the classical AzS (290 ms), owing to its optimized event hub and blob IO.
4. Monetary Cost Modeling and Performance Characteristics
Total cost for FaaS workflows factors in function execution (measured in GB-seconds), inter-function data transfer, and orchestration (state transition and storage operations). On AWS, cost is dominated by orchestration (state transitions), with nearly all spend attributable to this component in workflows with complex control flow or high fan-in/fan-out. The cost model is given as
where is function execution unit price, execution time, allocated memory, is the per-state-transition cost, and is the number of transitions (Kulkarni et al., 27 Sep 2025). Azure workflow costs, in both AzS and AzN variants, increase with storage operation frequency and data volume, with higher charges when function outputs exceed queue size limits and must be shunted through blobs or Event Hubs.
These insights dispel the notion that function execution is the dominant cost driver; orchestration and data transfer costs can eclipse function resource metering, especially as workflows grow in scale and complexity (Kulkarni et al., 27 Sep 2025).
5. Platform Configuration Insights and Developer Guidance
Parameter tuning is critical for achieving stable and performant behavior in FaaS workflow platforms. For Azure Durable Functions, settings such as maxConcurrentOrchestratorFunctions and maxConcurrentActivityFunctions (MO, MA) directly affect system throughput and latency; suboptimal values can manifest as increased queuing, high-latency spikes, or even platform instability under load (Kulkarni et al., 27 Sep 2025). Correctly configuring partition counts (in AzN) is necessary for achieving optimal parallelism and minimizing bottlenecks in high-throughput workloads.
Developers must be aware that Azure’s inter-function latency and orchestration overheads are highly sensitive to payload size and storage backend selection, requiring careful design to minimize unnecessary or large state passing. For AWS, the intrinsic isolation of function containers yields consistent scaling but may impose higher orchestration costs for workflows with deep nesting or abundant state transitions.
Key platform-specific trade-offs should inform workflow decomposition strategies, with consideration given to dataflow patterns, concurrency, and data volume.
6. Research Gaps and Future Optimization Directions
Current research highlights several underexplored challenges and potential advances. Among them:
- Workflow-aware pre-warming strategies that target entire dependency chains rather than only individual functions could mitigate cascading cold start delays (Kulkarni et al., 27 Sep 2025).
- Dynamic partitioning and scheduling algorithms that leverage real-time metrics (performance, cost, hardware heterogeneity) could enable workflow distribution across different CSPs or even hybrid public/private topologies, optimizing for both latency and cost (Kulkarni et al., 27 Sep 2025).
- Economic modeling refinement to better capture trade-offs between orchestration charges, raw resource usage, and data movement, especially as cloud billing models evolve.
- Further investigation of hardware-level heterogeneity (e.g., Intel vs. AMD SKUs on Azure) is required to understand its impact on both function execution times and resource scheduling efficiency.
The data suggest that there is significant room for architectural innovation and workflow-level optimizations to bridge the gap between the theoretical elasticity of FaaS and the practical performance/costs observed in real-world use (Kulkarni et al., 27 Sep 2025).
Table: Key Differences in Orchestration and Performance on Public FaaS Workflow Platforms
Platform | Orchestration Mechanism | Inter-Function Data Path | Orchestration Cost Dominance |
---|---|---|---|
AWS Step Funcs | Per-function state machine, SQS handoff | SQS (payload ≤ 256 KB) | Up to 99% of total cost |
Azure Durable S | Orchestrator + activity functions, task hub | Queue (small) / Blob (large) | High, scales with complexity |
Azure Netherite | Orchestrator, Event Hub, Page/Block Blobs | Event Hub (up to 9 MB), Blobs | Lower latency, high scaling |
Summary
Extensive empirical characterization reveals that while FaaS workflow platforms offer high concurrency, elasticity, and abstraction, their real-world performance and cost-efficiency are circumscribed by (a) orchestration and inter-function data transfer overheads, (b) cascading cold start delays, and (c) intricate platform-specific behavioral parameters (Kulkarni et al., 27 Sep 2025). Effective utilization requires insight into resource management, appropriate platform selection, and workflow decomposition that minimizes orchestration and inter-function costs. Open research remains in developing fine-grained, cross-layer workflow optimization algorithms, system-level pre-warming and scheduling strategies, and cost models that fully capture the workflow-level economic realities of public serverless platforms.