Towards providing reliable job completion time predictions using PCS (2401.10354v1)
Abstract: In this paper we build a case for providing job completion time predictions to cloud users, similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals that providing predictability can come at the expense of performance and fairness. Existing cloud scheduling systems optimize for extreme points in the trade-off space, making them either extremely unpredictable or impractical. To address this challenge, we present PCS, a new scheduling framework that aims to provide predictability while balancing other traditional objectives. The key idea behind PCS is to use Weighted-Fair-Queueing (WFQ) and find a suitable configuration of different WFQ parameters (e.g., class weights) that meets specific goals for predictability. It uses a simulation-aided search strategy, to efficiently discover WFQ configurations that lie on the Pareto front of the trade-off space between these objectives. We implement and evaluate PCS in the context of DNN job scheduling on GPUs. Our evaluation, on a small scale GPU testbed and larger-scale simulations, shows that PCS can provide accurate completion time estimates while marginally compromising on performance and fairness.
- AFS-Simulator. https://github.com/chhwang/schedsim.
- Amazon. http://www.amazon.com.
- CloudLab. https://www.cloudlab.us.
- GPT. https://openai.com/blog/chatgpt.
- Philly traces. https://github.com/msr-fiddle/philly-traces.
- Temu. http://www.temu.com.
- Uber. https://www.uber.com.
- pfabric: Minimal near-optimal datacenter transport. ACM SIGCOMM Computer Communication Review, 43(4):435–446, 2013.
- Navigating the high cost of ai compute. April 2023.
- Information-agnostic flow scheduling for commodity data centers. In Prox. Usenix NSDI, 2015.
- Customer service in the face of flight delays. Journal of Vacation Marketing, 2010.
- Reducing tail latency using duplication: A multi-layered approach. In Proc. ACM CoNEXT, 2019.
- Hardware-software co-design for real-time latency-accuracy navigation in tinyml applications. IEEE Micro, 2023.
- Cilantro:{{\{{Performance-Aware}}\}} resource allocation for general objectives via online feedback. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 2023.
- Balancing efficiency and fairness in heterogeneous gpu clusters for deep learning. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20. Association for Computing Machinery, 2020.
- Cluster fair queueing: Speeding up data-parallel jobs with delay guarantees. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, 2017.
- Deep learning research and development platform: Characterizing and scheduling with qos guarantees on gpu clusters. IEEE Transactions on Parallel and Distributed Systems, 2020.
- Resource and deadline-aware job scheduling in dynamic hadoop clusters. In 2015 IEEE International Parallel and Distributed Processing Symposium, pages 956–965. IEEE, 2015.
- M. Chowdhury and I. Stoica. Efficient coflow scheduling without prior knowledge. ACM SIGCOMM Computer Communication Review, 45(4):393–406, 2015.
- R. Cordingly and W. Lloyd. Enabling serverless sky computing. In 2023 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 2023.
- Inferline: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing. Association for Computing Machinery, 2020.
- How tolerable is delay?: Consumers’ evaluations of internet web sites after waiting. Journal of interactive marketing, 1999.
- Analysis and simulation of a fair queueing algorithm. ACM SIGCOMM Computer Communication Review, 1989.
- Cam01-3: Connection preemption in multi-class networks. In IEEE Globecom 2006, pages 1–6. IEEE, 2006.
- Decentralized Task-aware Scheduling for Data Center Networks. In Proc. ACM SIGCOMM, 2014.
- Is advance knowledge of flow sizes a plausible assumption. In Proc. USENIX NSDI, 2019.
- Elastic hyperparameter tuning on the cloud. In Proceedings of the ACM Symposium on Cloud Computing, 2021.
- Workload adaptive flow scheduling. In Proceedings of the 14th International Conference on Emerging Networking EXperiments and Technologies. ACM, 2018.
- Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM european conference on Computer Systems, 2012.
- Chronus: A novel deadline-aware scheduler for deep learning training jobs. In Proceedings of the ACM Symposium on Cloud Computing, 2021.
- Habitat: A runtime-based computational performance predictor for deep neural network training. In Proc. USENIX ATC, 2021.
- Dominant resource fairness: Fair allocation of multiple resource types. USENIX Association, March 2011.
- Choosy: Max-min fair sharing for datacenter jobs with constraints. In Proceedings of the 8th ACM European Conference on Computer Systems, 2013.
- I. Giagkiozis and P. J. Fleming. Pareto front estimation for decision making. Evolutionary Computation, 2014.
- Bistro: Scheduling data-parallel jobs against live production systems. In Proc. USENIX ATC, 2015.
- Altruistic scheduling in Multi-Resource clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 2016.
- Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), Boston, MA, 2019. USENIX Association.
- Serving DNNs like clockwork: Performance predictability from the bottom up. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 443–462. USENIX Association, November 2020.
- Metaverse as a service: Megascale social 3d on the cloud. In Proceedings of the 2023 ACM Symposium on Cloud Computing. Association for Computing Machinery, 2023.
- Mittos: Supporting millisecond tail tolerance with fast rejecting slo-aware os interface. In Proc SOSP, 2017.
- LinnOS: Predictability on unpredictable flash storage with a light neural network. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 173–190. USENIX Association, November 2020.
- M. Hertzum and K. Hornbæk. Frustration: Still a common user experience. ACM Trans. Comput.-Hum. Interact., jan 2023. Just Accepted.
- Mesos: A platform for fine-grained resource sharing in the data center. In Proc. USENIX NSDI, 2011.
- M. K. Hui and L. Zhou. How does waiting duration information influence customers’ reactions to waiting for services? Journal of Applied Social Psychology, 1996.
- Elastic resource sharing for distributed deep learning. USENIX Association, 2021.
- R. Ibrahim. Sharing delay information in service systems: a literature survey. Queueing Systems, 2018.
- Towards a redundancy-aware network stack for data centers. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, pages 57–63, 2016.
- How can we train deep learning models across clouds and continents? an experimental study. arXiv preprint arXiv:2306.03163, 2023.
- A case for task sampling based learning for cluster job scheduling. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). USENIX Association, 2022.
- Network-aware scheduling for data-parallel jobs: Plan when you can. SIGCOMM Comput. Commun. Rev., 2015.
- Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, 2019.
- Morpheus: Towards automated SLOs for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, 2016.
- Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019, pages 1–16, 2019.
- SelfTune: Tuning cluster managers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
- ModelKeeper: Accelerating DNN training via automated training warmup. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
- Allox: Compute allocation in hybrid clusters. In Proc. EuroSys, 2020.
- Dcloud: deadline-aware resource allocation for cloud computing jobs. IEEE transactions on parallel and distributed systems, 27(8):2248–2260, 2015.
- Hyperband: A novel bandit-based approach to hyperparameter optimization. 2017.
- Hypersched: Dynamic resource reallocation for model development on a deadline. In Proceedings of the ACM Symposium on Cloud Computing, 2019.
- Bgl: Gpu-efficient gnn training by optimizing graph data i/o and preprocessing. arXiv preprint arXiv:2112.08541, 2021.
- Heracles: Improving resource efficiency at scale. In Proc. ISCA, 2015.
- Themis: Fair and efficient GPU cluster scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), 2020.
- D. H. Maister et al. The psychology of waiting lines. Citeseer, 1984.
- Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19. Association for Computing Machinery, 2019.
- Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication. 2019.
- S. M. Miller. Predictability and human stress: Toward a clarification of evidence and theory. Advances in Experimental Social Psychology. Academic Press, 1981.
- Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, 2018.
- Friends, not Foes: Synthesizing Existing Transport Strategies for Data Center Networks. In Proc. ACM SIGCOMM, 2014.
- Network scheduling aware task placement in datacenters. In Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies, CoNEXT ’16.
- Tail-robust scheduling via limited processor sharing. Performance Evaluation, 2009.
- Heterogeneity-Aware cluster scheduling policies for deep learning workloads. USENIX Association, 2020.
- Pareto multi objective optimization. In Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems, pages 84–91, 2005.
- A survey on the metaverse: The state-of-the-art, technologies, applications, and challenges. IEEE Internet of Things Journal, 2023.
- Resource elasticity in distributed deep learning. Proceedings of Machine Learning and Systems, 2020.
- 3sigma: Distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18. Association for Computing Machinery, 2018.
- Optimus: An efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference. ACM, 2018.
- A generic communication scheduler for distributed dnn training acceleration. In Proc. ACM SOSP, 2019.
- A. Perkiomaki. How estimated delivery dates (edds) enhance user experience and drive transactions for ecommerce brands. September 2023.
- Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 2021.
- Real-time delivery time forecasting and promising in online retailing: When will your package arrive? Manufacturing & Service Operations Management, 24(3):1421–1436, 2022.
- Omega: flexible, scalable schedulers for large compute clusters. In Proc. ACM EuroSys, 2013.
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- R. Singhal and A. Verma. Predicting job completion time in heterogeneous mapreduce environments. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016.
- D. R. Smith. A new proof of the optimality of the shortest remaining processing time discipline. Operations Research.
- I. Stoica and S. Shenker. From cloud computing to sky computing. In Proceedings of the Workshop on Hot Topics in Operating Systems. Association for Computing Machinery, 2021.
- Bamboo: Making preemptible instances resilient for affordable training of large dnns. arXiv preprint arXiv:2204.12013, 2022.
- Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing. Association for Computing Machinery, 2013.
- An efficient and non-intrusive gpu scheduling framework for deep learning training systems. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.
- Integrating priority with share in the priority-based weighted fair queuing scheduler for real-time networks. Real-Time Systems, 2002.
- MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters. USENIX Association, 2022.
- A. Wierman and M. Harchol-Balter. Classifying scheduling policies with respect to higher moments of conditional response time. ACM SIGMETRICS Performance Evaluation Review, 2005.
- Transparent {{\{{GPU}}\}} sharing in container clouds for deep learning workloads. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 69–85, 2023.
- Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, 2018.
- AntMan: Dynamic scaling on GPU clusters for deep learning. In Proc. USENIX OSDI, 2020.
- Unfoldml: Cost-aware and uncertainty-based dynamic 2d prediction for multi-stage classification. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
- Prediction of the resource consumption of distributed deep learning systems. Proc. ACM Measurement and Analysis of Computing Systems, 2022.
- SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
- P. Yu and M. Chowdhury. Salus: Fine-grained gpu sharing primitives for deep learning applications. arXiv preprint arXiv:1902.04610, 2019.
- How do delay announcements shape customer behavior? an empirical study. Management Science, 63(1):1–20, 2017.
- Slaq: quality-driven scheduling for distributed machine learning. In Proceedings of the 2017 Symposium on Cloud Computing, 2017.
- Model-Switching: Dealing with fluctuating workloads in Machine-Learning-as-a-Service systems. USENIX Association, 2020.
- Hived: Sharing a gpu cluster for deep learning with guarantees. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, 2020.
- Swp: Microsecond network slos without priorities. arXiv preprint arXiv:2103.01314, 2021.
- Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, 2023.
- E. Zitzler and L. Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Transactions on Evolutionary Computation, 1999.