Papers
Topics
Authors
Recent
Search
2000 character limit reached

Elastic deep learning in multi-tenant GPU cluster

Published 26 Sep 2019 in cs.DC | (1909.11985v2)

Abstract: We study how to support elasticity, i.e., the ability to dynamically adjust the parallelism (number of GPUs), for deep neural network (DNN) training. Elasticity can benefit multi-tenant GPU cluster management in many ways, e.g., achieving various scheduling objectives (e.g., job throughput, job completion time, GPU efficiency) according to cluster load variations, maximizing the use of transient idle resources, performance profiling, job migration, and straggler mitigation. However, existing parallelism adjustment strategies incur high overheads, which hinder many applications from making effective use of elasticity. We propose EDL to enable low-overhead elastic deep learning with a simple API. We present techniques that are necessary to reduce the overhead of parallelism adjustments, such as stop-free scaling and dynamic data pipeline. We also demonstrate that EDL can indeed bring significant benefits to the above-listed applications in GPU cluster management.

Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.