Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MalleTrain: Deep Neural Network Training on Unfillable Supercomputer Nodes (2404.15668v1)

Published 24 Apr 2024 in cs.DC

Abstract: First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it use even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs -- information that it then employs to optimize resource allocations dynamically, in real time. We describe the MalleTrain architecture and present the results of a detailed experimental evaluation on a supercomputer GPU cluster and several representative DNN training workloads, including neural architecture search and hyperparameter optimization. Our results not only confirm the practical feasibility of leveraging idle supercomputer nodes for DNN training but improve significantly on prior results, improving training throughput by up to 22.3\% without requiring users to provide job scalability information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. 2023. https://aws.amazon.com/ec2/spot/. Accessed: 2023-10-23.
  2. 2023. https://cloud.google.com/spot-vms. Accessed: 2023-10-23.
  3. 2023. https://azure.microsoft.com/en-us/products/virtual-machines/spot/. Accessed: 2023-10-23.
  4. 2023. https://www.alcf.anl.gov/polaris. Accessed: 2023-10-23.
  5. 2023. https://www.top500.org/lists/top500/2023/11/. Accessed: 2023-11-15.
  6. Parallel programming with migratable objects: Charm++ in practice. In SC’14: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 647–658.
  7. A scalable, commodity data center network architecture. ACM SIGCOMM computer communication review 38, 4 (2008), 63–74.
  8. fairDMS: Rapid model training by data and model reuse. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 394–405.
  9. Experience and practice of batch scheduling on Leadership Supercomputers at Argonne. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1–24.
  10. Distributed optimization with tunable learned priors for robust ptycho-tomography. arXiv preprint arXiv:2009.09498 (2020).
  11. Malleable Invasive Applications. In Software Engineering (Workshops). 123–126.
  12. The evolution of the Pegasus workflow management software. Computing in Science & Engineering 21, 4 (2019), 22–36.
  13. Malleable applications for scalable high performance computing. Cluster Computing 10, 3 (2007), 323–337.
  14. BOHB: Robust and efficient hyperparameter optimization at scale. In International Conference on Machine Learning. PMLR, 1437–1446.
  15. Dror G Feitelson and Larry Rudolph. 1996. Toward convergence in job schedulers for parallel supercomputers. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 1–26.
  16. PBS: a unified priority-based scheduler. In Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and Modeling of Computer Systems. 203–214.
  17. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
  18. Workload analysis of Blue Waters. arXiv preprint arXiv:1703.00924 (2017).
  19. Predicting disruptive instabilities in controlled fusion plasmas through deep learning. Nature 568, 7753 (2019), 526–531.
  20. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77–88.
  21. Full Waveform Inversion-Based Ultrasound Computed Tomography Acceleration Using Two-Dimensional Convolutional Neural Networks. Journal of Nondestructive Evaluation, Diagnostics and Prognostics of Engineering Systems 6, 4 (2023), 041004.
  22. A system for massively parallel hyperparameter tuning. Proceedings of Machine Learning and Systems 2 (2020), 230–246.
  23. DARTS: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
  24. Deep learning accelerated light source experiments. In IEEE/ACM Third Workshop on Deep Learning on Supercomputers. IEEE, 20–28.
  25. FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks. In IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 299–310.
  26. BraggNN: Fast X-ray Bragg Peak Analysis Using Deep Learning. arXiv preprint arXiv:2008.08198 (2020).
  27. Ahuva W. Mu’alem and Dror G. Feitelson. 2001. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12, 6 (2001), 529–543.
  28. PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
  29. Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification and Implications. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 1186–1202.
  30. Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning.. In OSDI, Vol. 21. 1–18.
  31. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. arXiv preprint arXiv:2008.12260 (2020).
  32. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.
  33. Alexander Sergeev and Mike Del Balso. 2018. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
  34. Sathish S Vadhiyar and Jack J Dongarra. 2003. SRS: A framework for developing malleable and migratable parallel applications for distributed systems. Parallel Processing Letters 13, 02 (2003), 291–312.
  35. NAS-Bench-101: Towards reproducible neural architecture search. In International Conference on Machine Learning. PMLR, 7105–7114.
  36. Slurm: Simple Linux utility for resource management. In Workshop on job scheduling strategies for parallel processing. Springer, 44–60.
  37. Haihang You and Hao Zhang. 2012. Comprehensive workload analysis and modeling of a petascale supercomputer. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 253–271.
  38. ImageNet training in minutes. In 47th International Conference on Parallel Processing. 1–10.
  39. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets