Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Interpretable Scheduling Algorithms for Data Processing Clusters (2405.19131v1)

Published 29 May 2024 in cs.DC

Abstract: Workloads in data processing clusters are often represented in the form of DAG (Directed Acyclic Graph) jobs. Scheduling DAG jobs is challenging. Simple heuristic scheduling algorithms are often adopted in practice in production data centres. There is much room for scheduling performance optimisation for cost saving. Recently, reinforcement learning approaches (like decima) have been attempted to optimise DAG job scheduling and demonstrate clear performance gain in comparison to traditional algorithms. However, reinforcement learning (RL) approaches face their own problems in real-world deployment. In particular, their black-box decision making processes and generalizability in unseen workloads may add a non-trivial burden to the cluster administrators. Moreover, adapting RL models on unseen workloads often requires significant amount of training data, which leaves edge cases run in a sub-optimal mode. To fill the gap, we propose a new method to distill a simple scheduling policy based on observations of the behaviours of a complex deep learning model. The simple model not only provides interpretability of scheduling decisions, but also adaptive to edge cases easily through tuning. We show that our method achieves high fidelity to the decisions made by deep learning models and outperforms these models when additional heuristics are taken into account.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Peeking inside the black-box: A survey on explainable artificial intelligence (xai). IEEE Access, 6:52138–52160, 2018. doi: 10.1109/ACCESS.2018.2870052.
  2. Alibaba. Cluster data collected from production clusters in Alibaba for cluster management research., 2018. URL https://github.com/alibaba/clusterdata/tree/master/cluster-trace-v2018.
  3. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pages 1383–1394, 2015.
  4. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013.
  5. Interpretable deep models for icu outcome prediction. In AMIA annual symposium proceedings, volume 2016, page 371. American Medical Informatics Association, 2016.
  6. Multi-processor scheduling to minimize flow time with ε𝜀\varepsilonitalic_ε resource augmentation. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 363–372, 2004.
  7. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019.
  8. Are visual explanations useful? a case study in model-in-the-loop prediction. arXiv preprint arXiv:2007.12248, 2020.
  9. Reinforcement learning is supervised learning on optimized data, 2020.
  10. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.
  11. G: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of OSDI16: 12th USENIX Symposium on Operating Systems Design and Implementation, page 81, 2016a.
  12. {{\{{GRAPHENE}}\}}: Packing and {{\{{Dependency-Aware}}\}} scheduling for {{\{{Data-Parallel}}\}} clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 81–97, 2016b.
  13. Hadoop. Yarn fair scheduler, 2022. URL https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/FairScheduler.html.
  14. Mesos: A platform for {{\{{Fine-Grained}}\}} resource sharing in the data center. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), 2011.
  15. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  16. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59–72, 2007.
  17. Neural dag scheduling via one-shot priority sampling. In The Eleventh International Conference on Learning Representations, 2022.
  18. Rl-cache: Learning-based cache admission for content delivery. In Proceedings of the 2019 Workshop on Network Meets AI & ML, pages 57–63, 2019.
  19. Sagedb: A learned database system. 2021.
  20. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
  21. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  22. Invariant random forest: Tree-based model solution for ood generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13772–13781, 2024.
  23. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257–263, 2023. doi: 10.1038/s41586-023-06004-9. URL https://doi.org/10.1038/s41586-023-06004-9.
  24. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM special interest group on data communication, pages 270–288. 2019.
  25. Neo: a learned query optimizer. Proceedings of the VLDB Endowment, 12(11):1705–1718, 2019.
  26. Interpreting deep learning-based networking systems. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 154–171, 2020.
  27. Dl2: A deep learning-driven scheduler for deep learning clusters. IEEE Transactions on Parallel and Distributed Systems, 32(8):1947–1960, 2021.
  28. Apache TEZ project, 2019. URL https://tez.apache.org/.
  29. Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019.
  30. Lsched: A workload-aware learned query scheduler for analytical database systems. In Proceedings of the 2022 International Conference on Management of Data, pages 1228–1242, 2022.
  31. Apache Capacity Scheduler, 2022. URL https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html.
  32. Improved approximation algorithms for shop scheduling problems. SIAM Journal on Computing, 23(3):617–632, 1994.
  33. TPC-H. The TPC-H Benchmarks., 2018. URL https://www.tpc.org/tpch/.
  34. Fine-tuning reinforcement learning models is secretly a forgetting mitigation problem. arXiv preprint arXiv:2402.02868, 2024.
  35. Resilient distributed datasets: A {{\{{Fault-Tolerant}}\}} abstraction for {{\{{In-Memory}}\}} cluster computing. In 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, 2012.
  36. Network planning with deep reinforcement learning. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 258–271, 2021.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com