Dynamic scheduling of virtual machines running hpc workloads in scientific grids (1009.4841v1)

Published 24 Sep 2010 in cs.DC, cs.DS, and cs.PF

Abstract: The primary motivation for uptake of virtualization has been resource isolation, capacity management and resource customization allowing resource providers to consolidate their resources in virtual machines. Various approaches have been taken to integrate virtualization in to scientific Grids especially in the arena of High Performance Computing (HPC) to run grid jobs in virtual machines, thus enabling better provisioning of the underlying resources and customization of the execution environment on runtime. Despite the gains, virtualization layer also incur a performance penalty and its not very well understood that how such an overhead will impact the performance of systems where jobs are scheduled with tight deadlines. Since this overhead varies the types of workload whether they are memory intensive, CPU intensive or network I/O bound, and could lead to unpredictable deadline estimation for the running jobs in the system. In our study, we have attempted to tackle this problem by developing an intelligent scheduling technique for virtual machines which monitors the workload types and deadlines, and calculate the system over head in real time to maximize number of jobs finishing within their agreed deadlines.

Citations (337)

View on Semantic Scholar

Summary

The paper proposes an adaptive VM scheduling technique that leverages real-time performance metrics and CDF-based thresholds.
It demonstrates that dynamic and statistical scheduling methods significantly improve job success rates and reduce deadline misses.
The study underscores the potential of these adaptive strategies for enhancing resource utilization and efficiency in grid computing.

Dynamic Scheduling of Virtual Machines Running HPC Workloads in Scientific Grids

The paper under discussion presents an investigation into the utilization of virtualization technology within the context of High-Performance Computing (HPC) workloads on scientific grids. The focus is on dynamic scheduling of virtual machines (VMs) to optimize resource usage while maintaining system performance relative to job deadlines.

Overview

Virtualization provides significant advantages such as flexibility, security, and resource control, however, it introduces performance overheads that can affect deadline-critical HPC jobs. This paper targets the challenge of unpredictably varying workloads—CPU-intensive, memory-intensive, or network I/O bound—and the associated overhead of virtualization, which can lead to deadline estimation difficulties.

The research introduces an intelligent scheduling technique designed to handle diverse workloads by monitoring workload types and deadlines in real-time. The proposed scheduler adapts dynamically to the system's performance metrics, with the goal of maximizing the number of jobs completed within agreed deadlines.

Methodology

The methodology revolves around an adaptive scheduling approach integrated with cumulative distribution function (CDF) models to dynamically adjust job acceptance thresholds based on real-time success rates. The thresholds are determined by an "x-factor," which evaluates the remaining job duration against the time to deadline, allowing jobs to be either accepted or rejected dynamically.

The paper employs a simulation-based approach to evaluate the proposed scheduling techniques under different configurations:

Physical Baseline (alg_1): Workloads executed on physical machines.
Virtual Static (alg_2): No intelligent virtualization overhead management.
Virtual Dynamic (alg_3): Dynamic management of virtualization overhead.
Virtual Dynamic Adaptive (alg_4): Adaptive algorithm for varying success probabilities.
Virtual Dynamic Statistical (alg_5): Integrates statistical models to adjust thresholds.

The efficacy of these algorithms was measured across extensive simulated workloads to determine the impact on job success rates and deadline adherence.

Key Findings

The results demonstrate that while the physical baseline (alg_1) remains superior, the developed adaptive algorithms (alg_4 and alg_5) yield substantial improvements in success rates and reduced deadline misses over static configurations (alg_2). Specifically, the dynamic statistical approach (alg_5) achieves a high completion rate with minimized misses, highlighting the benefits of incorporating CDF-based dynamic threshold adjustments.

Implications and Future Directions

This research contributes significantly to the understanding of virtualization in HPC grids by demonstrating that adaptive and statistically driven VM scheduling can substantially improve performance metrics even when faced with virtualization overheads. The implications extend to enhanced resource utilization and potential applications in data center management, particularly concerning energy efficiency and operational cost reduction by optimizing VM migration and load balancing.

Future work could explore integrating these scheduling techniques with broader grid management systems, potentially enhancing real-world applications in environments such as the Large Hadron Collider's computing grid. Further investigation into the interplay between virtualization dynamic scheduling and other optimization strategies, like job reshaping and dependency-aware live migration, would also be valuable.

This research provides a pivotal stepping stone towards more adaptive and intelligent workload management in cloud-based HPC environments, vital for harnessing the full potential of virtualized resources in scientific computing.

PDF Markdown