Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Towards providing reliable job completion time predictions using PCS (2401.10354v1)

Published 18 Jan 2024 in cs.DC and cs.LG

Abstract: In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. AFS-Simulator. https://github.com/chhwang/schedsim.
  2. Amazon. http://www.amazon.com.
  3. CloudLab. https://www.cloudlab.us.
  4. GPT. https://openai.com/blog/chatgpt.
  5. Philly traces. https://github.com/msr-fiddle/philly-traces.
  6. Temu. http://www.temu.com.
  7. Uber. https://www.uber.com.
  8. pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Computer Communication Review, 43(4):435–446, 2013.
  9. Navigating the high cost of ai compute. April 2023.
  10. Information-agnostic flow scheduling for commodity data centers. In Prox. Usenix NSDI, 2015.
  11. Customer service in the face of flight delays. Journal of Vacation Marketing, 2010.
  12. Reducing tail latency using duplication: A multi-layered approach. In Proc. ACM CoNEXT, 2019.
  13. Hardware-software co-design for real-time latency-accuracy navigation in tinyml applications. IEEE Micro, 2023.
  14. Cilantro:{{\{{Performance-Aware}}\}} resource allocation for general objectives via online feedback. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 2023.
  15. Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20. Association for Computing Machinery, 2020.
  16. Cluster fair queueing: Speeding up data-parallel jobs with delay guarantees. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, 2017.
  17. Deep learning research and development platform: Characterizing and scheduling with qos guarantees on gpu clusters. IEEE Transactions on Parallel and Distributed Systems, 2020.
  18. Resource and deadline-aware job scheduling in dynamic hadoop clusters. In 2015 IEEE International Parallel and Distributed Processing Symposium, pages 956–965. IEEE, 2015.
  19. M. Chowdhury and I. Stoica. Efficient coflow scheduling without prior knowledge. ACM SIGCOMM Computer Communication Review, 45(4):393–406, 2015.
  20. R. Cordingly and W. Lloyd. Enabling serverless sky computing. In 2023 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 2023.
  21. Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. Association for Computing Machinery, 2020.
  22. How tolerable is delay?: Consumers’ evaluations of internet web sites after waiting. Journal of interactive marketing, 1999.
  23. Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review, 1989.
  24. Cam01-3: Connection preemption in multi-class networks. In IEEE Globecom 2006, pages 1–6. IEEE, 2006.
  25. Decentralized Task-aware Scheduling for Data Center Networks. In Proc. ACM SIGCOMM, 2014.
  26. Is advance knowledge of flow sizes a plausible assumption. In Proc. USENIX NSDI, 2019.
  27. Elastic hyperparameter tuning on the cloud. In Proceedings of the ACM Symposium on Cloud Computing, 2021.
  28. Workload adaptive flow scheduling. In Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies. ACM, 2018.
  29. Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM european conference on Computer Systems, 2012.
  30. Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, 2021.
  31. Habitat: A runtime-based computational performance predictor for deep neural network training. In Proc. USENIX ATC, 2021.
  32. Dominant resource fairness: Fair allocation of multiple resource types. USENIX Association, March 2011.
  33. Choosy: Max-min fair sharing for datacenter jobs with constraints. In Proceedings of the 8th ACM European Conference on Computer Systems, 2013.
  34. I. Giagkiozis and P. J. Fleming. Pareto front estimation for decision making. Evolutionary Computation, 2014.
  35. Bistro: Scheduling data-parallel jobs against live production systems. In Proc. USENIX ATC, 2015.
  36. Altruistic scheduling in Multi-Resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 2016.
  37. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), Boston, MA, 2019. USENIX Association.
  38. Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
  39. Metaverse as a service: Megascale social 3d on the cloud. In Proceedings of the 2023 ACM Symposium on Cloud Computing. Association for Computing Machinery, 2023.
  40. Mittos: Supporting millisecond tail tolerance with fast rejecting slo-aware os interface. In Proc SOSP, 2017.
  41. LinnOS: Predictability on unpredictable flash storage with a light neural network. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 173–190. USENIX Association, November 2020.
  42. M. Hertzum and K. Hornbæk. Frustration: Still a common user experience. ACM Trans. Comput.-Hum. Interact., jan 2023. Just Accepted.
  43. Mesos: A platform for fine-grained resource sharing in the data center. In Proc. USENIX NSDI, 2011.
  44. M. K. Hui and L. Zhou. How does waiting duration information influence customers’ reactions to waiting for services? Journal of Applied Social Psychology, 1996.
  45. Elastic resource sharing for distributed deep learning. USENIX Association, 2021.
  46. R. Ibrahim. Sharing delay information in service systems: a literature survey. Queueing Systems, 2018.
  47. Towards a redundancy-aware network stack for data centers. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages 57–63, 2016.
  48. How can we train deep learning models across clouds and continents? an experimental study. arXiv preprint arXiv:2306.03163, 2023.
  49. A case for task sampling based learning for cluster job scheduling. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, 2022.
  50. Network-aware scheduling for data-parallel jobs: Plan when you can. SIGCOMM Comput. Commun. Rev., 2015.
  51. Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, 2019.
  52. Morpheus: Towards automated SLOs for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 2016.
  53. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–16, 2019.
  54. SelfTune: Tuning cluster managers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
  55. ModelKeeper: Accelerating DNN training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
  56. Allox: Compute allocation in hybrid clusters. In Proc. EuroSys, 2020.
  57. Dcloud: deadline-aware resource allocation for cloud computing jobs. IEEE transactions on parallel and distributed systems, 27(8):2248–2260, 2015.
  58. Hyperband: A novel bandit-based approach to hyperparameter optimization. 2017.
  59. Hypersched: Dynamic resource reallocation for model development on a deadline. In Proceedings of the ACM Symposium on Cloud Computing, 2019.
  60. Bgl: Gpu-efficient gnn training by optimizing graph data i/o and preprocessing. arXiv preprint arXiv:2112.08541, 2021.
  61. Heracles: Improving resource efficiency at scale. In Proc. ISCA, 2015.
  62. Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), 2020.
  63. D. H. Maister et al. The psychology of waiting lines. Citeseer, 1984.
  64. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19. Association for Computing Machinery, 2019.
  65. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication. 2019.
  66. S. M. Miller. Predictability and human stress: Toward a clarification of evidence and theory. Advances in Experimental Social Psychology. Academic Press, 1981.
  67. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, 2018.
  68. Friends, not Foes: Synthesizing Existing Transport Strategies for Data Center Networks. In Proc. ACM SIGCOMM, 2014.
  69. Network scheduling aware task placement in datacenters. In Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’16.
  70. Tail-robust scheduling via limited processor sharing. Performance Evaluation, 2009.
  71. Heterogeneity-Aware cluster scheduling policies for deep learning workloads. USENIX Association, 2020.
  72. Pareto multi objective optimization. In Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems, pages 84–91, 2005.
  73. A survey on the metaverse: The state-of-the-art, technologies, applications, and challenges. IEEE Internet of Things Journal, 2023.
  74. Resource elasticity in distributed deep learning. Proceedings of Machine Learning and Systems, 2020.
  75. 3sigma: Distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18. Association for Computing Machinery, 2018.
  76. Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, 2018.
  77. A generic communication scheduler for distributed dnn training acceleration. In Proc. ACM SOSP, 2019.
  78. A. Perkiomaki. How estimated delivery dates (edds) enhance user experience and drive transactions for ecommerce brands. September 2023.
  79. Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 2021.
  80. Real-time delivery time forecasting and promising in online retailing: When will your package arrive? Manufacturing & Service Operations Management, 24(3):1421–1436, 2022.
  81. Omega: flexible, scalable schedulers for large compute clusters. In Proc. ACM EuroSys, 2013.
  82. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  83. R. Singhal and A. Verma. Predicting job completion time in heterogeneous mapreduce environments. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016.
  84. D. R. Smith. A new proof of the optimality of the shortest remaining processing time discipline. Operations Research.
  85. I. Stoica and S. Shenker. From cloud computing to sky computing. In Proceedings of the Workshop on Hot Topics in Operating Systems. Association for Computing Machinery, 2021.
  86. Bamboo: Making preemptible instances resilient for affordable training of large dnns. arXiv preprint arXiv:2204.12013, 2022.
  87. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. Association for Computing Machinery, 2013.
  88. An efficient and non-intrusive gpu scheduling framework for deep learning training systems. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
  89. Integrating priority with share in the priority-based weighted fair queuing scheduler for real-time networks. Real-Time Systems, 2002.
  90. MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. USENIX Association, 2022.
  91. A. Wierman and M. Harchol-Balter. Classifying scheduling policies with respect to higher moments of conditional response time. ACM SIGMETRICS Performance Evaluation Review, 2005.
  92. Transparent {{\{{GPU}}\}} sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 69–85, 2023.
  93. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, 2018.
  94. AntMan: Dynamic scaling on GPU clusters for deep learning. In Proc. USENIX OSDI, 2020.
  95. Unfoldml: Cost-aware and uncertainty-based dynamic 2d prediction for multi-stage classification. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
  96. Prediction of the resource consumption of distributed deep learning systems. Proc. ACM Measurement and Analysis of Computing Systems, 2022.
  97. SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
  98. P. Yu and M. Chowdhury. Salus: Fine-grained gpu sharing primitives for deep learning applications. arXiv preprint arXiv:1902.04610, 2019.
  99. How do delay announcements shape customer behavior? an empirical study. Management Science, 63(1):1–20, 2017.
  100. Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing, 2017.
  101. Model-Switching: Dealing with fluctuating workloads in Machine-Learning-as-a-Service systems. USENIX Association, 2020.
  102. Hived: Sharing a gpu cluster for deep learning with guarantees. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 2020.
  103. Swp: Microsecond network slos without priorities. arXiv preprint arXiv:2103.01314, 2021.
  104. Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
  105. E. Zitzler and L. Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Transactions on Evolutionary Computation, 1999.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets