A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning (1703.04221v2)

Published 13 Mar 2017 in cs.DC and cs.AI

Abstract: Automatic decision-making approaches, such as reinforcement learning (RL), have been applied to (partially) solve the resource allocation problem adaptively in the cloud computing system. However, a complete cloud resource allocation framework exhibits high dimensions in state and action spaces, which prohibit the usefulness of traditional RL techniques. In addition, high power consumption has become one of the critical concerns in design and control of cloud computing systems, which degrades system reliability and increases cooling cost. An effective dynamic power management (DPM) policy should minimize power consumption while maintaining performance degradation within an acceptable level. Thus, a joint virtual machine (VM) resource allocation and power management framework is critical to the overall cloud computing system. Moreover, novel solution framework is necessary to address the even higher dimensions in state and action spaces. In this paper, we propose a novel hierarchical framework for solving the overall resource allocation and power management problem in cloud computing systems. The proposed hierarchical framework comprises a global tier for VM resource allocation to the servers and a local tier for distributed power management of local servers. The emerging deep reinforcement learning (DRL) technique, which can deal with complicated control problems with large state space, is adopted to solve the global tier problem. Furthermore, an autoencoder and a novel weight sharing structure are adopted to handle the high-dimensional state space and accelerate the convergence speed. On the other hand, the local tier of distributed server power managements comprises an LSTM based workload predictor and a model-free RL based power manager, operating in a distributed manner.

Citations (237)

View on Semantic Scholar

Summary

The paper proposes a hierarchical DRL framework that jointly optimizes cloud VM allocation and power management.
It employs a global DRL tier with autoencoders and weight-sharing to efficiently handle high-dimensional state spaces for VM allocation.
Empirical tests on Google cluster traces show a 54% reduction in power consumption while balancing energy savings with performance latency.

An Examination of a Hierarchical Framework for Cloud Resource Allocation and Power Management via Deep Reinforcement Learning

The paper "A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning" addresses a critical challenge in cloud computing: the joint optimization of resource allocation and power management. Using a hierarchical framework enhanced by Deep Reinforcement Learning (DRL), the authors propose a solution that comprises a global tier for virtual machine (VM) resource allocation and a local tier for distributed power management. This paper seeks to address the scalability issues inherent in cloud systems characterized by high-dimensional state and action spaces.

Architectural Design

The framework utilizes two distinct tiers to manage resource allocation and power consumption. The global tier employs DRL to process VM allocation over a cluster of servers, leveraging the ability of DRL to handle complex decision-making with large state spaces. Notably, an autoencoder and a novel weight-sharing approach are incorporated to manage these high-dimensional spaces efficiently and accelerate convergence.

On the other hand, the local tier handles the distributed power management of individual servers using Long Short-Term Memory (LSTM) for workload prediction and a model-free RL for adaptive power control. The introduction of LSTM is particularly noteworthy as it captures long-term dependencies in time-series predictions, offering future workload insights to optimize server power states.

Methodology and Innovations

The critical innovation of this work lies in the application of hierarchical DRL to cloud computing. The global tier adopts a continuous-time and event-driven decision framework, aligning decisions with each VM request. This significantly reduces the action space, making the problem tractable. The inclusion of an autoencoder captures essential features of the server state while a weight-sharing mechanism ensures scalability and efficient training across the server groups.

The local tier's novel integration of LSTM-based workload prediction and continuous-time Q-learning for SMDP offers a responsive power management policy. This combination aids in transitioning servers between active and sleep states based on predicted workloads, thus aligning power consumption with performance requirements.

Empirical Validation

The framework was tested using actual Google cluster traces, demonstrating substantial improvements over baseline models. Specifically, in a 30-server cluster processing 95,000 job requests, the hierarchical framework yielded a 53.97% reduction in power and energy consumption and also achieved an optimal balance between power savings and latency reduction. In contrast, conventional round-robin VM allocation strategies showed higher power usage, emphasizing the effectiveness of the proposed method.

Implications and Future Directions

The research provides theoretical and practical implications for the deployment of hierarchical DRL in cloud data centers, indicating significant advancements in energy efficiency and system reliability. By reducing job latency and optimizing power consumption simultaneously, this framework addresses both cost and environmental impacts, key considerations for data center operations.

Moving forward, the implications for the advancement of AI-driven cloud management are profound. Future work could explore the application of this hierarchical DRL approach across diverse cloud environments and incorporate additional types of resource constraints and objectives. Moreover, continuous evaluation with varied real-world trace inputs and conditions would bolster the robustness and adaptability of the proposed solution.

Conclusion

This paper makes a significant contribution to cloud computing management by introducing a scalable hierarchical DRL framework that effectively integrates resource allocation with dynamic power management. The results not only exhibit strong numerical performance but also pave the way for further innovations in AI-based cloud optimization, thus enhancing both economic feasibility and ecological sustainability of cloud service providers.

PDF Markdown