How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study (2306.03163v4)
Abstract: This paper aims to answer the question: Can deep learning models be cost-efficiently trained on a global market of spot VMs spanning different data centers and cloud providers? To provide guidance, we extensively evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models. To expand the current training options further, we compare the scalability potential for hybrid-cloud scenarios by adding cloud resources to on-premise hardware to improve training throughput. Finally, we show how leveraging spot instance pricing enables a new cost-efficient way to train models with multiple cheap VMs, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
- 2023a. Amazon AWS. Accessed: 19 May 2023, aws.amazon.com.
- 2023b. Amazon AWS Spot Pricing. https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/. Accessed: 2023-09-27.
- 2023. Backblaze. https://backblaze.com/. Accessed: 2023-10-05.
- 2023. Google Cloud. Accessed: 19 May 2023, cloud.google.com.
- 2023. Google Cloud Region Picker. https://cloud.withgoogle.com/region-picker/. Accessed: 2023-10-05.
- 2023. Hivemind GAC Issue. https://github.com/learning-at-home/hivemind/issues/566. Accessed: 2023-10-05.
- 2023. LambdaLabs. Accessed: 19 May 2023, lambdalabs.com.
- 2023. Microsoft Azure. Accessed: 19 May 2023, portal.azure.com.
- High Performance I/O For Large Scale Deep Learning. In 2019 IEEE International Conference on Big Data (Big Data). 5965–5967. https://doi.org/10.1109/BigData47090.2019.9005703
- Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).
- Training Transformers Together. In NeurIPS 2021 Competitions and Demonstrations Track. PMLR, 335–342.
- Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116 [cs.CL]
- Amazon SageMaker Autopilot: A White Box AutoML Solution at Scale. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning (Portland, OR, USA) (DEEM’20). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages. https://doi.org/10.1145/3399579.3399870
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Tim Dettmers. 2016. 8-Bit Approximations for Parallelism in Deep Learning. arXiv:1511.04561 [cs.NE]
- Distributed Deep Learning In Open Collaborations. Advances in Neural Information Processing Systems 34 (2021), 7879–7897.
- Backbones-review: Feature extraction networks for deep learning and deep reinforcement learning approaches. arXiv 2022. arXiv preprint arXiv:2206.08016 ([n. d.]).
- Anne C Elster and Tor A Haugdahl. 2022. Nvidia hopper gpu and grace cpu highlights. Computing in Science & Engineering 24, 2 (2022), 95–100.
- Wikimedia Foundation. 2023. Wikimedia Downloads. https://dumps.wikimedia.org
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Kai Hwang. 1992. Advanced Computer Architecture: Parallelism,Scalability,Programmability (1st ed.). McGraw-Hill Higher Education.
- Kyungyong Lee and Myungjun Son. 2017. DeepSpotCloud: Leveraging Cross-Region GPU Spot Instances for Deep Learning. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD). 98–105. https://doi.org/10.1109/CLOUD.2017.21
- SpotLake: Diverse Spot Instance Dataset Archive Service. In 2022 IEEE International Symposium on Workload Characterization (IISWC). 242–255. https://doi.org/10.1109/IISWC55918.2022.00029
- Scaling distributed machine learning with the parameter server. In 11th {normal-{\{{USENIX}normal-}\}} Symposium on Operating Systems Design and Implementation ({normal-{\{{OSDI}normal-}\}} 14). 583–598.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems 30 (2017).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11976–11986.
- Mlperf training benchmark. Proceedings of Machine Learning and Systems 2 (2020), 336–349.
- Petar Maymounkov and David Mazieres. 2002. Kademlia: A peer-to-peer information system based on the xor metric. In Peer-to-Peer Systems: First InternationalWorkshop, IPTPS 2002 Cambridge, MA, USA, March 7–8, 2002 Revised Papers. Springer, 53–65.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
- Statistical analysis of Amazon EC2 cloud pricing models. Concurrency and Computation: Practice and Experience 31, 18 (2019), e4451.
- Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.), Vol. 202. PMLR, 28492–28518. https://proceedings.mlr.press/v202/radford23a.html
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
- ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 [cs.DC]
- SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. arXiv preprint arXiv:2301.11913 (2023).
- Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices. In Advances in Neural Information Processing Systems, Vol. 34. https://proceedings.neurips.cc/paper/2021/file/97275a23ca44226c9964043c8462be96-Paper.pdf
- Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
- Learning@home team. 2020. Hivemind: a Library for Decentralized Deep Learning. https://github.com/learning-at-home/hivemind.
- ResNet Autoencoders for Unsupervised Feature Learning From High-Dimensional Data: Deep Models Resistant to Performance Degradation. IEEE Access 9 (2021), 40511–40520. https://doi.org/10.1109/ACCESS.2021.3064819
- Stable and low-precision training for large-scale vision-language models. arXiv preprint arXiv:2304.13013 (2023).
- Scheduling ML Training on Unreliable Spot Instances. In Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion (Leicester, United Kingdom) (UCC ’21). Association for Computing Machinery, New York, NY, USA, Article 29, 8 pages. https://doi.org/10.1145/3492323.3495594
- SkyPilot: An Intercloud Broker for Sky Computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 437–455. https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2019).
- Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems 35 (2022), 25464–25477.
- Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).