Deep Back-Filling: a Split Window Technique for Deep Online Cluster Job Scheduling (2401.09910v1)
Abstract: Job scheduling is a critical component of workload management systems that can significantly influence system performance, e.g., in HPC clusters. The scheduling objectives are often mixed, such as maximizing resource utilization and minimizing job waiting time. An increasing number of researchers are moving from heuristic-based approaches to Deep Reinforcement Learning approaches in order to optimize scheduling objectives. However, the job scheduler's state space is partially observable to a DRL-based agent because the job queue is practically unbounded. The agent's observation of the state space is constant in size since the input size of the neural networks is predefined. All existing solutions to this problem intuitively allow the agent to observe a fixed window size of jobs at the head of the job queue. In our research, we have seen that such an approach can lead to "window staleness" where the window becomes full of jobs that can not be scheduled until the cluster has completed sufficient work. In this paper, we propose a novel general technique that we call \emph{split window}, which allows the agent to observe both the head \emph{and tail} of the queue. With this technique, the agent can observe all arriving jobs at least once, which completely eliminates the window staleness problem. By leveraging the split window, the agent can significantly reduce the average job waiting time and average queue length, alternatively allowing the use of much smaller windows and, therefore, faster training times. We show a range of simulation results using HPC job scheduling trace data that supports the effectiveness of our technique.
- D. Zhang, D. Dai, Y. He, F. S. Bao, and B. Xie, “Rlscheduler: An automated hpc batch job scheduler using reinforcement learning,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–15.
- Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. E. Papka, “Deep reinforcement agent for scheduling in hpc,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 807–816.
- D. Zhang, D. Dai, and B. Xie, “Schedinspector: A batch job scheduling inspector using reinforcement learning,” in Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, ser. HPDC ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 97–109. [Online]. Available: https://doi.org/10.1145/3502181.3531470
- U. Lublin and D. G. Feitelson, “The workload on parallel supercomputers: modeling the characteristics of rigid jobs,” Journal of Parallel and Distributed Computing, vol. 63, no. 11, pp. 1105–1122, 2003.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- M. Andrychowicz, A. Raichuk, P. Stańczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski et al., “What matters for on-policy deep actor-critic methods? a large-scale study,” in International conference on learning representations, 2020.
- D. Carastan-Santos and R. Y. De Camargo, “Obtaining dynamic scheduling policies with simulation and machine learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–13.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
- R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, vol. 12, 1999.
- H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh, “Learning scheduling algorithms for data processing clusters,” in Proceedings of the ACM Special Interest Group on Data Communication, 2019, pp. 270–288.
- Q. Wang, H. Zhang, C. Qu, Y. Shen, X. Liu, and J. Li, “Rlschert: An hpc job scheduler using deep reinforcement learning and remaining time prediction,” Applied Sciences, vol. 11, no. 20, p. 9448, 2021.
- H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proceedings of the 15th ACM workshop on hot topics in networks, 2016, pp. 50–56.
- A. I. Orhean, F. Pop, and I. Raicu, “New scheduling approach using reinforcement learning for heterogeneous distributed systems,” Journal of Parallel and Distributed Computing, vol. 117, pp. 292–302, 2018.
- G. Domeniconi, E. K. Lee, V. Venkataswamy, and S. Dola, “Cush: Cognitive scheduler for heterogeneous high performance computing system,” in Workshop on Deep Reinforcement Learning for Knowledge Discover, DRL4KDD, 2019.