GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy (2505.12355v2)

Published 18 May 2025 in cs.AI

Abstract: Cost-aware Dynamic Workflow Scheduling (CADWS) is a key challenge in cloud computing, focusing on devising an effective scheduling policy to efficiently schedule dynamically arriving workflow tasks, represented as Directed Acyclic Graphs (DAG), to suitable virtual machines (VMs). Deep reinforcement learning (DRL) has been widely employed for automated scheduling policy design. However, the performance of DRL is heavily influenced by the design of the problem-tailored policy network and is highly sensitive to hyperparameters and the design of reward feedback. Considering the above-mentioned issues, this study proposes a novel DRL method combining Graph Attention Networks-based policy network and Evolution Strategy, referred to as GATES. The contributions of GATES are summarized as follows: (1) GATES can capture the impact of current task scheduling on subsequent tasks by learning the topological relationships between tasks in a DAG. (2) GATES can assess the importance of each VM to the ready task, enabling it to adapt to dynamically changing VM resources. (3) Utilizing Evolution Strategy's robustness, exploratory nature, and tolerance for delayed rewards, GATES achieves stable policy learning in CADWS. Extensive experimental results demonstrate the superiority of the proposed GATES in CADWS, outperforming several state-of-the-art algorithms. The source code is available at: https://github.com/YaShen998/GATES.

Summary

Overview of GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy

The paper presents GATES, an innovative deep reinforcement learning framework designed to tackle the complexity of Cost-aware Dynamic Workflow Scheduling (CADWS) in cloud computing environments. CADWS is defined by the need to efficiently manage dynamically arriving workflow tasks, modeled as Directed Acyclic Graphs (DAGs), across heterogeneous resources such as virtual machines (VMs), while minimizing operational costs and penalties associated with Service Level Agreement (SLA) violations. Traditional scheduling strategies struggle with the dynamic and stochastic nature of the problem space, often relying on static heuristics or poorly calibrated models which can lead to suboptimal decision-making.

Methodological Contributions

GATES integrates Graph Attention Networks (GAT) with Evolution Strategy (ES) to develop a powerful scheduling policy network. Key contributions of the framework include:

Graph-Based Task Embedding: Utilizing GAT, the framework captures intricate dependencies among tasks represented in DAGs, allowing for a contextual understanding of workflow execution dynamics. This approach enables a more nuanced appreciation of task interdependencies, which significantly enhances decision-making efficacy over conventional methods.
VM Importance Estimation: Through a graph-based analysis, GATES assesses the relative importance and suitability of each VM for executing ready tasks, thereby aligning task allocation with optimal resource utilization strategies.
Robust Policy Training: Leveraging ES, GATES avoids common pitfalls in gradient-based reinforcement learning approaches, such as sensitivity to reward functions and hyperparameter optimization. This results in a more stable learning process, particularly suited to dynamic, non-stationary environments encountered in CADWS.

The design of GATES specifically addresses the distinctive needs of CADWS by balancing between policy robustness and computational efficiency. It demonstrates enhanced capability to adaptively learn scheduling policies that purposefully manage both VM rental costs and SLA penalties.

Experimental Validation

The experimental evaluation showcases GATES outperforming state-of-the-art approaches, including deterministic heuristics and deep learning models like SPN-CWS and ES-RL, across a selection of workflow patterns, sizes, and SLA stringency scenarios. The results consistently highlight GATES' superior efficacy in reducing total costs, attributed to its strategic resource allocation and effective handling of dynamic workflow patterns.

Key performance metrics indicate GATES' ability to maintain lower average total costs and reduced standard deviations, suggesting robust performance stability and a marked capability to handle unforeseen workload variations in real-time cloud settings. These outcomes align well with theoretical expectations, suggesting that the incorporation of attention mechanisms within the scheduling framework yields better adaptability and decision accuracy.

Implications and Future Directions

The implications of this work stretch beyond facilitating immediate cost efficiencies for cloud service providers. By significantly enhancing scheduling adaptiveness and optimizing resource utilization, GATES contributes to overall improvements in cloud service reliability and performance. The success of GATES implies broader applicability within other graph-structured optimization fields, where similar dynamic and complex dependencies exist, such as logistics and network flow management.

Future research directions may explore hybrid models integrating additional machine learning techniques, aiming to further refine policy networks and address scalability to even larger workflow datasets. Enhancements in understanding the trade-offs between real-time responsiveness and computational overheads in changing cloud environments remain an open area of interest.

In summary, GATES introduces a novel paradigm in cloud workflow scheduling by marrying graph theory with reinforcement learning, setting a new standard for efficiency and robustness in dynamic and cost-sensitive settings.

GitHub

GitHub - YaShen998/GATES (2 stars)