GPU Cluster Scheduling for Network-Sensitive Deep Learning (2401.16492v1)
Abstract: We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.
- Garnet: A detailed on-chip network model inside a full-system simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 33–42, 2009.
- Alibaba PAI. https://github.com/AlibabaPAI, Accessed: 2022.06.15.
- Analytical network backend. https://github.com/astra-sim/astra-network-analytical, Accessed: 2023.
- AWS EC2 pricing. https://aws.amazon.com/ec2/pricing/on-demand/, Accessed: 2023.
- Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20, New York, NY, USA, 2020. Association for Computing Machinery.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
- Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, SoCC ’21, page 609–623, New York, NY, USA, 2021. Association for Computing Machinery.
- GEneral Matrix Multiplication (GEMM). https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, Accessed: 2023.
- GPU direct RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html, Accessed: 2023.
- A survey of deep learning techniques for autonomous driving. CoRR, abs/1910.07738, 2019.
- Tiresias: A GPU cluster manager for distributed deep learning. In NSDI), pages 485–500. USENIX Association, 2019.
- Identity mappings in deep residual networks. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, 2016.
- Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, 2021. Association for Computing Machinery.
- Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, page 457–472, New York, NY, USA, 2023. Association for Computing Machinery.
- Infiniband. https://en.wikipedia.org/wiki/InfiniBand, Accessed: 2022.06.08.
- Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In ATC, pages 947–960. USENIX Association, 2019.
- Generative artificial intelligence: Trends and prospects. Computer, 55(10):107–112, 2022.
- Evaluation of an infiniband switch: Choose latency or bandwidth, but not both. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 180–191, 2020.
- Impact of RoCE Congestion Control Policies on Distributed Training of DNNs. In Proc. IEEE Symp. on High-Performance Interconnects, July 2022.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, 2012.
- Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE TPDS, 31(1):94–110, 2019.
- PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020.
- Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 289–304, Santa Clara, CA, February 2020. USENIX Association.
- Kungfu: Making training in distributed machine learning adaptive. In OSDI, pages 937–954. USENIX Association, 2020.
- Mlperf training benchmark. MLSys, 2:336–349, 2020.
- Machine Learning as a Service (MLaaS). https://levity.ai/blog/mlaas-platforms-comparative-guide, Accessed: 2023.
- PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proc. ACM SOSP, 2019.
- The foreground–background queue: a survey. Performance evaluation, 65(3-4):286–307, 2008.
- NVidia Quantum switch. https://nvdam.widen.net/s/k8sqcr6gzb/infiniband-quantum-2-qm9700-series-datasheet-us-nvidia-1751454-r8-web, Accessed: 2023.
- NVidia Spectrum switch. https://nvdam.widen.net/s/mmvbnpk8qk/networking-ethernet-switches-sn5000-datasheet-us, Accessed: 2023.
- NVlink. https://www.nvidia.com/en-us/data-center/nvlink/, Accessed: 2023.
- Deep learning for mobile multimedia: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 13(3s):1–22, 2017.
- Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, New York, NY, USA, 2018. Association for Computing Machinery.
- Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), pages 1–18. USENIX Association, July 2021.
- Scalable distributed training of recommendation models: An ASTRA-sim+ ns3 case-study with TCP/IP transport. In Proc. IEEE HOTI, pages 33–42, 2020.
- ASTRA-sim: Enabling SW/HW co-design exploration for distributed DL training platforms. In Proc. IEEE ISPASS, pages 81–92, 2020.
- Themis: A network bandwidth-aware collective scheduling policy for distributed training of DL models. In Proc.ACM/IEEE ISCA, pages 581–596, 2022.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
- Stash: A comprehensive stall-centric characterization of public cloud vms for distributed deep learning. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), pages 1–12, 2023.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, San Diego, CA, May 7-9, 2015.
- Apache Spark. https://spark.apache.org/.
- TPU. https://en.wikipedia.org/wiki/Tensor_Processing_Unit, Accessed: 2022.06.15.
- ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. https://arxiv.org/abs/2303.14006, 2023.
- Gandiva: Introspective cluster scheduling for deep learning. In OSDI, pages 595–610. USENIX Association, 2018.
- AntMan: Dynamic scaling on GPU clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 533–548. USENIX Association, November 2020.
- Apache Hadoop YARN. https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html.
- Horus: Interference-aware and prediction-based scheduling in deep learning systems. IEEE Transactions on Parallel and Distributed Systems, 33(1):88–100, 2022.
- Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys’10 - Proceedings of the EuroSys 2010 Conference, pages 265–278, 01 2010.
- Aakash Sharma (3 papers)
- Vivek M. Bhasi (3 papers)
- Sonali Singh (6 papers)
- George Kesidis (72 papers)
- Mahmut T. Kandemir (10 papers)
- Chita R. Das (67 papers)