Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Optimal Resource Efficiency with Fairness in Heterogeneous GPU Clusters (2403.18545v1)

Published 27 Mar 2024 in cs.DC

Abstract: Ensuring the highest training throughput to maximize resource efficiency, while maintaining fairness among users, is critical for deep learning (DL) training in heterogeneous GPU clusters. However, current DL schedulers provide only limited fairness properties and suboptimal training throughput, impeding tenants from effectively leveraging heterogeneous resources. The underlying design challenge stems from inherent conflicts between efficiency and fairness properties. In this paper, we introduce OEF, a new resource allocation framework specifically developed for achieving optimal resource efficiency and ensuring diverse fairness properties in heterogeneous GPU clusters. By integrating resource efficiency and fairness within a global optimization framework, OEF is capable of providing users with maximized overall efficiency, as well as various guarantees of fairness, in both cooperative and non-cooperative environments. We have implemented OEF in a cluster resource manager and conducted large-scale experiments, showing that OEF can improve the overall training throughput by up to 32% while improving fairness compared to state-of-the-art heterogeneity-aware schedulers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Mask r-cnn. In Proceedings of CVPR, 2017.
  2. Attention is all you need. In Advances in neural information processing systems, 2017.
  3. Fast yolo: A fast you only look once system for real-time embedded object detection in video. arXiv preprint arXiv:1709.05943, 2017.
  4. Google cloud platform. https://cloud.google.com/gpu, 2022-10-20.
  5. Tiresias: A GPU cluster manager for distributed deep learning. In Proceedings of NSDI, 2019.
  6. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of Eurosys, 2018.
  7. Allox: compute allocation in hybrid clusters. In Proceedings of Eurosys, 2020.
  8. Heterogeneity-aware cluster scheduling policies for deep learning workloads. In Proceedings of OSDI, 2020.
  9. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of Eurosys, 2020.
  10. Themis: Fair and efficient GPU cluster scheduling. In Proceedings of NSDI, 2020.
  11. Analysis and simulation of a fair queueing algorithm. In Proceedings of ACM Sigcomm, 1989.
  12. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the ACM (JACM), 1973.
  13. Dror G Feitelson. Workload modeling for computer systems performance evaluation. 2015.
  14. Proportionate progress: A notion of fairness in resource allocation. In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, pages 345–354, 1993.
  15. Fast scheduling of periodic tasks on multiple resources. In Proceedings of 9th International Parallel Processing Symposium, pages 280–288. IEEE, 1995.
  16. Multiple-resource periodic scheduling problem: how much fairness is necessary? In RTSS 2003. 24th IEEE Real-Time Systems Symposium, 2003, pages 142–151. IEEE, 2003.
  17. Fair queuing for aggregated multiple links. ACM SIGCOMM Computer Communication Review, 31(4):189–197, 2001.
  18. Fairness in routing and load balancing. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 568–578. IEEE, 1999.
  19. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of NSDI, 2011.
  20. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of SOSP, 2009.
  21. Altruistic scheduling in multi-resource clusters. In Proceedings of OSDI, 2016.
  22. {{\{{HUG}}\}}:{{\{{Multi-Resource}}\}} fairness for correlated and elastic demands. In Proceedings of NSDI, 2016.
  23. Per-server dominant-share fairness (ps-dsf): A multi-resource fair allocation mechanism for heterogeneous servers. In 2017 IEEE International Conference on Communications (ICC), 2017.
  24. Multi-resource fair allocation in heterogeneous cloud computing systems. IEEE Transactions on Parallel and Distributed Systems, 26(10):2822–2835, 2014.
  25. Multi-resource fair sharing for datacenter jobs with placement constraints. In Proceedings of SC, 2016.
  26. Jeffrey Jaffe. Bottleneck flow control. IEEE Transactions on Communications, 1981.
  27. Mlaas in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In Proceedings of NSDI, 2022.
  28. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of SC, 2021.
  29. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In Proceedings of ATC, 2019.
  30. The lovely but lonely vickrey auction. Combinatorial auctions, 2006.
  31. Antman: Dynamic scaling on GPU clusters for deep learning. In Proceedings of OSDI, 2020.
  32. Lester E. Dubins. On extreme points of convex sets. Journal of mathematical analysis and applications, 1962.
  33. Looking beyond {{\{{GPUs}}\}} for {{\{{DNN}}\}} scheduling on {{\{{Multi-Tenant}}\}} clusters. In Proceedings of OSDI), 2022.
  34. Fair allocation of heterogeneous and interchangeableresources. ACM SIGMETRICS Performance Evaluation Review, 2019.
  35. Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review, 1989.
  36. A generalized processor sharing approach to flow control in integrated services networks-the single node case. In Proceedings of INFOCOM, 1992.
  37. Wf2q: worst-case fair weighted fair queueing. In Proceedings of INFOCOM, 1996.
  38. Start-time fair queueing: A scheduling algorithm for integrated services packet switching networks. In Proceedings of SIGCOMM, 1996.
  39. Efficient fair queueing using deficit round robin. In Proceedings of SIGCOMM, 1995.
  40. A hierarchical fair service curve algorithm for link-sharing, real-time and priority services. 1997.
  41. Multiresource allocation: Fairness–efficiency tradeoffs in a unifying framework. IEEE/ACM Transactions on Networking, 2013.
  42. Elastic resource sharing for distributed deep learning. In Proceedings of NSDI, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com