Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPU Cluster Scheduling for Network-Sensitive Deep Learning (2401.16492v1)

Published 29 Jan 2024 in cs.PF, cs.DC, and cs.LG

Abstract: We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Garnet: A detailed on-chip network model inside a full-system simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 33–42, 2009.
  2. Alibaba PAI. https://github.com/AlibabaPAI, Accessed: 2022.06.15.
  3. Analytical network backend. https://github.com/astra-sim/astra-network-analytical, Accessed: 2023.
  4. AWS EC2 pricing. https://aws.amazon.com/ec2/pricing/on-demand/, Accessed: 2023.
  5. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20, New York, NY, USA, 2020. Association for Computing Machinery.
  6. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  7. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, page 609–623, New York, NY, USA, 2021. Association for Computing Machinery.
  8. GEneral Matrix Multiplication (GEMM). https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, Accessed: 2023.
  9. GPU direct RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html, Accessed: 2023.
  10. A survey of deep learning techniques for autonomous driving. CoRR, abs/1910.07738, 2019.
  11. Tiresias: A GPU cluster manager for distributed deep learning. In NSDI), pages 485–500. USENIX Association, 2019.
  12. Identity mappings in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, 2016.
  13. Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.
  14. Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 457–472, New York, NY, USA, 2023. Association for Computing Machinery.
  15. Infiniband. https://en.wikipedia.org/wiki/InfiniBand, Accessed: 2022.06.08.
  16. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In ATC, pages 947–960. USENIX Association, 2019.
  17. Generative artificial intelligence: Trends and prospects. Computer, 55(10):107–112, 2022.
  18. Evaluation of an infiniband switch: Choose latency or bandwidth, but not both. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 180–191, 2020.
  19. Impact of RoCE Congestion Control Policies on Distributed Training of DNNs. In Proc. IEEE Symp. on High-Performance Interconnects, July 2022.
  20. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.
  21. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE TPDS, 31(1):94–110, 2019.
  22. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020.
  23. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, Santa Clara, CA, February 2020. USENIX Association.
  24. Kungfu: Making training in distributed machine learning adaptive. In OSDI, pages 937–954. USENIX Association, 2020.
  25. Mlperf training benchmark. MLSys, 2:336–349, 2020.
  26. Machine Learning as a Service (MLaaS). https://levity.ai/blog/mlaas-platforms-comparative-guide, Accessed: 2023.
  27. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proc. ACM SOSP, 2019.
  28. The foreground–background queue: a survey. Performance evaluation, 65(3-4):286–307, 2008.
  29. NVidia Quantum switch. https://nvdam.widen.net/s/k8sqcr6gzb/infiniband-quantum-2-qm9700-series-datasheet-us-nvidia-1751454-r8-web, Accessed: 2023.
  30. NVidia Spectrum switch. https://nvdam.widen.net/s/mmvbnpk8qk/networking-ethernet-switches-sn5000-datasheet-us, Accessed: 2023.
  31. NVlink. https://www.nvidia.com/en-us/data-center/nvlink/, Accessed: 2023.
  32. Deep learning for mobile multimedia: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–22, 2017.
  33. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
  34. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 1–18. USENIX Association, July 2021.
  35. Scalable distributed training of recommendation models: An ASTRA-sim+ ns3 case-study with TCP/IP transport. In Proc. IEEE HOTI, pages 33–42, 2020.
  36. ASTRA-sim: Enabling SW/HW co-design exploration for distributed DL training platforms. In Proc. IEEE ISPASS, pages 81–92, 2020.
  37. Themis: A network bandwidth-aware collective scheduling policy for distributed training of DL models. In Proc.ACM/IEEE ISCA, pages 581–596, 2022.
  38. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  39. Stash: A comprehensive stall-centric characterization of public cloud vms for distributed deep learning. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), pages 1–12, 2023.
  40. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  41. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, San Diego, CA, May 7-9, 2015.
  42. Apache Spark. https://spark.apache.org/.
  43. TPU. https://en.wikipedia.org/wiki/Tensor_Processing_Unit, Accessed: 2022.06.15.
  44. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. https://arxiv.org/abs/2303.14006, 2023.
  45. Gandiva: Introspective cluster scheduling for deep learning. In OSDI, pages 595–610. USENIX Association, 2018.
  46. AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548. USENIX Association, November 2020.
  47. Apache Hadoop YARN. https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html.
  48. Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Transactions on Parallel and Distributed Systems, 33(1):88–100, 2022.
  49. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys’10 - Proceedings of the EuroSys 2010 Conference, pages 265–278, 01 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Aakash Sharma (3 papers)
  2. Vivek M. Bhasi (3 papers)
  3. Sonali Singh (6 papers)
  4. George Kesidis (72 papers)
  5. Mahmut T. Kandemir (10 papers)
  6. Chita R. Das (67 papers)
Citations (1)

Summary

  • The paper introduces Dally, a scheduler that minimizes communication overhead by deferring job placement until optimal GPU proximity is achieved.
  • It integrates a network-sensitive preemption mechanism and an auto-tuner that adjusts delay timers based on real-time network metrics.
  • Experiments demonstrate that Dally reduces training makespan by up to 69% and average job completion times by up to 83% under congested conditions.

GPU Cluster Scheduling for Network-Sensitive Deep Learning

The paper entitled "GPU Cluster Scheduling for Network-Sensitive Deep Learning" addresses the challenge of efficiently scheduling distributed deep learning (DDL) jobs in GPU clusters by considering the communication overheads inherent in training large-scale deep neural networks (DNNs). The authors introduce "Dally," a GPU-cluster scheduler designed to minimize these communication delays by making intelligent scheduling decisions based on network topology and the network sensitivity of jobs.

Key Contributions

The primary contribution of this research is the development of a novel GPU-cluster scheduler, Dally, incorporating three integral components:

  1. Delay Scheduling Mechanism: The scheduler employs a classical delay scheduling algorithm which defers job placement in anticipation of consolidated GPU resources, thereby minimizing communication overhead. This contrast with traditional schedulers that are relatively agnostic to GPU proximity.
  2. Network-Sensitive Job Preemption: A network-sensitive preemption strategy allows jobs highly sensitive to communication latencies to have priority in GPU placement decisions, further reducing the training time variations due to inter-GPU communication costs.
  3. Auto-Tuner Mechanism: An innovative addition to optimize the delay timers within the scheduling algorithm automatically. This mechanism adjusts the wait times for resource allocation based on observed network usage patterns and job requirements, enhancing the scheduler's adaptability to varying network conditions.

Experimental Evaluation

The authors present a comprehensive evaluation of Dally using ArtISt-sim, a specially developed high-fidelity DDL cluster simulator. By simulating various workloads with real-world traces, the researchers demonstrate that Dally can achieve substantial reductions in model training makespan and average job completion times compared to prevailing methods. Specifically:

  • Dally improves the end-to-end makespan by up to 69% compared to traditional schedulers like Tiresias that employ consolidation strategies based on DNN model skew.
  • It reduces the average job completion time by up to 83%.
  • Notably, the communication overhead is minimized by up to 98% in congested networking conditions.

Theoretical and Practical Implications

The work has significant theoretical and practical implications for AI and machine learning:

  • Theoretical Insights: This paper provides insights into how communication delays in DDL can be mitigated through strategic scheduling based on network awareness.
  • Practical Applications: By reducing communication overhead and improving resource allocation efficiency, organizations leveraging shared GPU clusters in both public and private clouds can significantly cut down on infrastructure costs.

Furthermore, this research highlights the importance of evolving traditional scheduling algorithms to adapt to modern high-performance networking technologies. Future research could explore more sophisticated machine learning-based prediction models for resource availability and integrate these into scheduling decisions.

Conclusion

In conclusion, the paper effectively showcases the potential impact of network-aware GPU scheduling on the efficiency of DNN model training. By holistically addressing the considerations of job placement and communication overheads, Dally represents a step forward in optimizing resource usage within multi-tenant GPU clusters. Future directions might include exploration into heterogeneous cluster environments and extending high-fidelity simulations to further enhance scheduling heuristic development.

X Twitter Logo Streamline Icon: https://streamlinehq.com