Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud (2403.05861v2)

Published 9 Mar 2024 in cs.DC

Abstract: Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However, the high cost of such resources makes them inaccessible to many users. Public cloud services, particularly Spot Virtual Machines (VMs), offer a cost-effective alternative, but their unpredictable availability poses a significant challenge to the crucial checkpointing process in DDL. To address this, we introduce DeepVM, a novel solution that recommends cost-effective cluster configurations by intelligently balancing the use of Spot and On-Demand VMs. DeepVM leverages a four-stage process that analyzes instance performance using the FLOPP (FLoating-point Operations Per Price) metric, performs architecture-level analysis with linear programming, and identifies the optimal configuration for the user-specific needs. Extensive simulations and real-world deployments in the AWS environment demonstrate that DeepVM consistently outperforms other policies, reducing training costs and overall makespan. By enabling cost-effective checkpointing with Spot VMs, DeepVM opens up DDL to a wider range of users and facilitates a more efficient training of complex DNNs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Q. Anthony and D. Dai, “Evaluating multi-level checkpointing for distributed deep neural network training,” in Proceedings of the 2021 SC Workshops Supplementary (SCWS), pp. 60–67, 2021.
  2. A. Eisenman, K. K. Matam, S. Ingram, D. Mudigere, R. Krishnamoorthi, K. Nair, M. Smelyanskiy, and M. Annavaram, “Check-N-Run: A checkpointing system for training deep learning recommendation models,” in Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), NSDI ’22, pp. 929–943, 2022.
  3. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, et al., “Pytorch distributed: Experiences on accelerating data parallel training,” arXiv preprint arXiv:2006.15704, 2020.
  4. A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” arXiv preprint arXiv:1802.05799, 2018.
  5. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” in Proceedings of the Neural Information Processing Systems, vol. 32 of NIPS ’19, Curran Associates, Inc., 2019.
  6. C. Kim, H. Lee, M. Jeong, W. Baek, B. Yoon, I. Kim, S. Lim, and S. Kim, “torchgpipe: On-the-fly pipeline parallelism for training giant models,” arXiv preprint arXiv:2004.09910, 2020.
  7. D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia, “Analysis and exploitation of dynamic pricing in the public cloud for ml training,” in Proceedings of the Workshop on Distributed Infrastructure, Systems, Programming, and AI, August 2020.
  8. Z. Cai, X. Li, R. Ruiz, and Q. Li, “Price forecasting for spot instances in cloud computing,” Future Generation Computer Systems, vol. 79, pp. 38–53, 2018.
  9. S. Lee, J. Hwang, and K. Lee, “Spotlake: Diverse spot instance dataset archive service,” in Proceedings of the 2022 IEEE International Symposium on Workload Characterization (IISWC), pp. 242–255, 2022.
  10. G. Fragiadakis, V. Liagkou, E. Filiopoulou, D. Fragkakis, C. Michalakelis, and M. Nikolaidou, “Cloud services cost comparison: a clustering analysis framework,” Computing, vol. 105, pp. 1–28, 03 2023.
  11. A. Harlap, A. Tumanov, A. Chung, G. R. Ganger, and P. B. Gibbons, “Proteus: Agile ml elasticity through tiered reliability in dynamic resource markets,” in Proceedings of the Twelfth European Conference on Computer Systems, EuroSys ’17, p. 589–604, 2017.
  12. M. Wagenländer, L. Mai, G. Li, and P. Pietzuch, “Spotnik: Designing distributed machine learning for transient cloud resources,” in Proceedings of the 12th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’20, USENIX Association, July 2020.
  13. K. Lee and M. Son, “Deepspotcloud: Leveraging cross-region gpu spot instances for deep learning,” in Proceedings of the 2017 IEEE 10th International Conference on Cloud Computing, CLOUD ’17, pp. 98–105, 2017.
  14. A. Andrzejak, D. Kondo, and S. Yi, “Decision model for cloud computing under sla constraints,” in Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS ’10, pp. 257–266, IEEE, 2010.
  15. H. Wang, Q. Jing, R. Chen, B. He, Z. Qian, and L. Zhou, “Distributed systems meet economics: pricing in the cloud,” in Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud ’10, 2010.
  16. S. Deochake, “Cloud cost optimization: A comprehensive review of strategies and case studies,” 2023.
  17. P. Kokkinos, T. A. Varvarigou, A. Kretsis, P. Soumplis, and E. A. Varvarigos, “Cost and utilization optimization of amazon ec2 instances,” in Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing, pp. 518–525, IEEE, 2013.
  18. E. M. Malta, S. Avila, and E. Borin, “Exploring the cost-benefit of aws ec2 gpu instances for deep learning applications,” in Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, pp. 21–29, 2019.
  19. “S. jeaugey. massively scale your deep learning training with nccl 2.4.” https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/, 2019.
  20. “Nvidia collective communications library (nccl).” https://developer.nvidia.com/nccl, 2023.
  21. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), (Broomfield, CO), pp. 583–598, 2014.
  22. J. Mohan, A. Phanishayee, and V. Chidambaram, “CheckFreq: Frequent, Fine-Grained DNN checkpointing,” in Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21), pp. 203–216, Feb. 2021.
  23. B. Nicolae, J. Li, J. M. Wozniak, G. Bosilca, M. Dorier, and F. Cappello, “Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models,” in Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID ’20, pp. 172–181, 2020.
  24. A. Wood, M. Hershcovitch, I. Ennmouri, W. Zong, S. Chennuri, S. Cohen, S. Sundararaman, D. Waddington, and P. Chin, “Towards fast crash-consistent cluster checkpointing,” in Proceedings of the IEEE High Performance Extreme Computing Conference, HPEC ’22, pp. 1–8, 2022.
  25. Z. Wang, Z. Jia, S. Zheng, Z. Zhang, X. Fu, T. S. E. Ng, and Y. Wang, “Gemini: Fast failure recovery in distributed training with in-memory checkpoints,” in Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, p. 364–381, 2023.
  26. T. Dey, K. Sato, B. Nicolae, J. Guo, J. Domke, W. Yu, F. Cappello, and K. Mohror, “Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning,” in 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW ’20, pp. 1036–1043, 2020.
  27. A. Das, F. Mueller, C. Siegel, and A. Vishnu, “Desh: Deep learning for system health prediction of lead times to failure in hpc,” in Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’18, p. 40–51, 2018.
  28. F. Wang, G. Wei, Q. Liu, J. Ou, x. wei, and H. Lv, “Boost neural networks by checkpoints,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 19719–19729, Curran Associates, Inc., 2021.
  29. E. M. B. Nagoudi, M. Abdul-Mageed, and H. Cavusoglu, “Growing together: Modeling human language learning with n-best multi-checkpoint machine translation,” in Proceedings of the Fourth Workshop on Neural Generation and Translation (A. Birch, A. Finch, H. Hayashi, K. Heafield, M. Junczys-Dowmunt, I. Konstas, X. Li, G. Neubig, and Y. Oda, eds.), (Online), pp. 169–177, July 2020.
  30. H. Chen, S. M. Lundberg, and S.-I. Lee, “Checkpoint ensembles: Ensemble methods from a single training process,” ArXiv, vol. abs/1710.03282, 2017.
  31. X. Xu, H. Liu, G. Tao, Z. Xuan, and X. Zhang, “Checkpointing and deterministic training for deep learning,” in Proceedings of the IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI, CAIN ’22, pp. 65–76, 2022.
  32. E. Rojas, D. Pérez, J. C. Calhoun, L. B. Gomez, T. Jones, and E. Meneses, “Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration,” in Proceedings of the 2021 IEEE International Conference on Cluster Computing, CLUSTRE ’21, pp. 492–503, 2021.
  33. E. Rojas, A. N. Kahira, E. Meneses, L. B. Gomez, and R. M. Badia, “A study of checkpointing in large scale training of deep neural networks,” ArXiv, vol. abs/2012.00825, 2020.
  34. J. Axboe, Flexible I/O Tester.

Summary

We haven't generated a summary for this paper yet.