Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud (2109.03389v1)

Published 8 Sep 2021 in eess.SY, cs.DC, and cs.SY

Abstract: Cloud training platforms, such as Amazon Web Services and Huawei Cloud provide users with computational resources to train their deep learning jobs. Elastic training is a service embedded in cloud training platforms that dynamically scales up or down the resources allocated to a job. The core technique of an elastic training system is to best allocate limited resources among heterogeneous jobs in terms of shorter queueing delay and higher training efficiency. This paper presents an optimal resource allocator for elastic training system that leverages a mixed-integer programming (MIP) model to maximize the training progress of deep learning jobs. We take advantage of the real-world job data obtained from ModelArts, the deep learning training platform of Huawei Cloud and conduct simulation experiments to compare the optimal resource allocator with a greedy one as benchmark. Numerical results show that the proposed allocator can reduce queuing time by up to 32% and accelerate training efficiency by up to 24% relative to the greedy resource allocator, thereby greatly improving user experience with Huawei ModelArts and potentially enabling the realization of higher profits for the product. Also, the optimal resource allocator is fast in decision-making, taking merely 0.4 seconds on average.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Liang Hu (64 papers)
  2. Jiangcheng Zhu (14 papers)
  3. Zirui Zhou (32 papers)
  4. Ruiqing Cheng (2 papers)
  5. Xiaolong Bai (8 papers)
  6. Yong Zhang (660 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.