UniFaaS: Programming across Distributed Cyberinfrastructure with Federated Function Serving (2403.19257v1)
Abstract: Modern scientific applications are increasingly decomposable into individual functions that may be deployed across distributed and diverse cyberinfrastructure such as supercomputers, clouds, and accelerators. Such applications call for new approaches to programming, distributed execution, and function-level management. We present UniFaaS, a parallel programming framework that relies on a federated function-as-a-service (FaaS) model to enable composition of distributed, scalable, and high-performance scientific workflows, and to support fine-grained function-level management. UniFaaS provides a unified programming interface to compose dynamic task graphs with transparent wide-area data management. UniFaaS exploits an observe-predict-decide approach to efficiently map workflow tasks to target heterogeneous and dynamic resources. We propose a dynamic heterogeneity-aware scheduling algorithm that employs a delay mechanism and a re-scheduling mechanism to accommodate dynamic resource capacity. Our experiments show that UniFaaS can efficiently execute workflows across computing resources with minimal scheduling overhead. We show that UniFaaS can improve the performance of a real-world drug screening workflow by as much as 22.99% when employing an additional 19.48% of resources and a montage workflow by 54.41% when employing an additional 47.83% of resources across multiple distributed clusters, in contrast to using a single cluster
- T. Shaffer, Z. Li, B. Tovar, Y. Babuji, T. Dasso, Z. Surma, K. Chard, I. Foster, and D. Thain, “Lightweight function monitors for fine-grained management in large scale Python applications,” in IEEE International Parallel and Distributed Processing Symposium. IEEE, 2021, pp. 786–796.
- Z. Liu, P. Balaprakash, R. Kettimuthu, and I. Foster, “Explaining wide area data transfer performance,” in 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC). ACM, 2017, p. 167–178.
- R. F. da Silva, R. M. Badia, V. Bala, D. Bard, P.-T. Bremer, I. Buckley, S. Caino-Lores, K. Chard, C. Goble, S. Jha et al., “Workflows community summit 2022: A roadmap revolution,” arXiv preprint arXiv:2304.00019, 2023.
- A. Alsaadi, L. Ward, A. Merzky, K. Chard, I. Foster, S. Jha, and M. Turilli, “Radical-Pilot and Parsl: Executing heterogeneous workflows on HPC platforms,” arXiv preprint arXiv:2105.13185, 2021.
- P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational workflows,” Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017.
- M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster, “Swift: A language for distributed parallel scripting,” Parallel Computing, vol. 37, no. 9, pp. 633–652, 2011.
- P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A distributed framework for emerging AI applications,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018, pp. 561–577.
- Y. Babuji, A. Woodard, Z. Li, D. S. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. M. Wozniak, I. Foster et al., “Parsl: Pervasive parallel programming in Python,” in 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019, pp. 25–36.
- M. Rocklin, “Dask: Parallel computation with blocked algorithms and task scheduling,” in 14th Python in Science Conference, vol. 130. SciPy Austin, TX, 2015, p. 136.
- E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. F. Da Silva, M. Livny et al., “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, pp. 17–35, 2015.
- D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: The Condor experience,” Concurrency and Computation: Practice and Experience, vol. 17, no. 2-4, pp. 323–356, 2005.
- R. Madduri, K. Chard, M. D’Arcy, S. C. Jung, A. Rodriguez, D. Sulakhe, E. Deutsch, C. Funk, B. Heavner, M. Richards et al., “Reproducible big data science: A case study in continuous fairness,” PloS one, vol. 14, no. 4, p. e0213013, 2019.
- E. Jonas, J. Schleier-Smith, V. Sreekanti, C.-C. Tsai, A. Khandelwal, Q. Pu, V. Shankar, J. Carreira, K. Krauth, N. Yadwadkar, J. Gonzalez, R. A. Popa, I. Stoica, and D. A. Patterson, “Cloud programming simplified: A Berkeley view on serverless computing,” arXiv preprint arXiv:1902.03383, 2019.
- Z. Li, R. Chard, Y. Babuji, B. Galewsky, T. J. Skluzacek, K. Nagaitsev, A. Woodard, B. Blaiszik, J. Bryan, D. S. Katz et al., “funcX: Federated function as a service for science,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4948–4963, 2022.
- R. Chard, Y. Babuji, Z. Li, T. Skluzacek, A. Woodard, B. Blaiszik, I. Foster, and K. Chard, “Funcx: A federated function serving fabric for science,” in 29th International Symposium on High-performance Parallel and Distributed Computing, 2020, pp. 65–76.
- K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer, synchronization, and sharing of big data,” IEEE Cloud Computing, vol. 1, no. 3, pp. 46–55, 2014.
- D. Balouek-Thomert, E. G. Renart, A. R. Zamani, A. Simonet, and M. Parashar, “Towards a computing continuum: Enabling edge-to-cloud integration for data-driven workflows,” The International Journal of High Performance Computing Applications, vol. 33, no. 6, pp. 1159–1174, 2019.
- M. AbdelBaky, M. Zou, A. R. Zamani, E. Renart, J. Diaz-Montes, and M. Parashar, “Computing in the continuum: Combining pervasive devices and services to support data-driven applications,” in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017, pp. 1815–1824.
- D. Rosendo, A. Costan, P. Valduriez, and G. Antoniu, “Distributed intelligence on the edge-to-cloud continuum: A systematic literature review,” Journal of Parallel and Distributed Computing, vol. 166, pp. 71–94, 2022.
- A. Clyde, S. Galanie, D. W. Kneller, H. Ma, Y. Babuji, B. Blaiszik, A. Brace, T. Brettin, K. Chard, R. Chard, L. Coates, I. Foster, D. Hauner, V. Kertesz, N. Kumar, H. Lee, Z. Li, A. Merzky, J. G. Schmidt, L. Tan, M. Titov, A. Trifan, M. Turilli, H. Van Dam, S. C. Chennubhotla, S. Jha, A. Kovalevsky, A. Ramanathan, M. S. Head, and R. Stevens, “High-throughput virtual screening and validation of a SARS-CoV-2 main protease noncovalent inhibitor,” Journal of Chemical Information and Modeling, vol. 62, no. 1, pp. 116–128, 2022.
- J. G. Pauloski, V. Hayot-Sasson, L. Ward, N. Hudson, C. Sabino, M. Baughman, K. Chard, and I. Foster, “Accelerating communications in federated applications with transparent object proxies,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.
- L. Ward, G. Sivaraman, J. G. Pauloski, Y. Babuji, R. Chard, N. Dandu, P. C. Redfern, R. S. Assary, K. Chard, L. A. Curtiss et al., “Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing,” in 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 2021, pp. 9–20.
- R. Chard, Z. Li, K. Chard, L. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. J. Franklin, and I. Foster, “Dlhub: Model and data serving for science,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 283–292.
- Z. Li, R. Chard, L. Ward, K. Chard, T. J. Skluzacek, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. J. Franklin et al., “Dlhub: Simplifying publication, discovery, and use of machine learning models in science,” Journal of Parallel and Distributed Computing, vol. 147, pp. 64–76, 2021.
- T. J. Skluzacek, R. Wong, Z. Li, R. Chard, K. Chard, and I. Foster, “A serverless framework for distributed bulk metadata extraction,” in 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021, pp. 7–18.
- A. E. Woodard, A. Trisovic, Z. Li, Y. Babuji, R. Chard, T. Skluzacek, B. Blaiszik, D. S. Katz, I. Foster, and K. Chard, “Real-time hep analysis with funcx, a high-performance platform for function as a service,” in EPJ Web of Conferences, vol. 245. EDP Sciences, 2020, p. 07046.
- Carbon Aware Workflow Scheduling. https://github.com/AK2000/caws. Accessed February 2024.
- A. A. Saadi, D. Alfe, Y. Babuji, A. Bhati, B. Blaiszik, A. Brace, T. Brettin, K. Chard, R. Chard, A. Clyde, P. Coveney, I. Foster, T. Gibbs, S. Jha, K. Keipert, D. Kranzlmüller, T. Kurth, H. Lee, Z. Li, H. Ma, G. Mathias, A. Merzky, A. Partin, A. Ramanathan, A. Shah, A. Stern, R. Stevens, L. Tan, M. Titov, A. Trifan, A. Tsaris, M. Turilli, H. Van Dam, S. Wan, D. Wifling, and J. Yin, “IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads,” in 50th International Conference on Parallel Processing, ser. ICPP 2021, 2021.
- T. Shu, Y. Guo, J. Wozniak, X. Ding, I. Foster, and T. Kurc, “Bootstrapping in-situ workflow auto-tuning via combining performance models of component applications,” in International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
- T.-P. Pham, J. J. Durillo, and T. Fahringer, “Predicting workflow task execution time in the cloud using a two-stage machine learning approach,” IEEE Transactions on Cloud Computing, vol. 8, no. 1, pp. 256–268, 2017.
- A. Singh, A. Rao, S. Purawat, and I. Altintas, “A machine learning approach for modular workflow performance prediction,” in 12th Workshop on Workflows in Support of Large-scale Science, 2017, pp. 1–11.
- T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in 22nd ACM International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794.
- J. Bader, F. Lehmann, L. Thamsen, J. Will, U. Leser, and O. Kao, “Lotaru: Locally estimating runtimes of scientific workflow tasks in heterogeneous clusters,” in 34th International Conference on Scientific and Statistical Database Management, 2022.
- J. D. Ullman, “NP-complete scheduling problems,” Journal of Computer and System sciences, vol. 10, no. 3, pp. 384–393, 1975.
- H. Topcuoglu, S. Hariri, and M.-Y. Wu, “Performance-effective and low-complexity task scheduling for heterogeneous computing,” IEEE Transactions on Parallel and Distributed Systems, vol. 13, no. 3, pp. 260–274, 2002.
- Y. Babuji, B. Blaiszik, T. Brettin, K. Chard, R. Chard, A. Clyde, I. Foster, Z. Hong, S. Jha, Z. Li et al., “Targeting SARS-CoV-2 with AI-and HPC-enabled lead generation: A first data release,” arXiv preprint arXiv:2006.02431, 2020.
- G. B. Berriman, E. Deelman, J. C. Good, J. C. Jacob, D. S. Katz, C. Kesselman, A. C. Laity, T. A. Prince, G. Singh, and M.-H. Su, “Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand,” in Optimizing scientific return for astronomy through information technologies, vol. 5493. SPIE, 2004, pp. 221–232.
- Amazon Lambda. https://aws.amazon.com/lambda. Accessed February 2024.
- Azure Functions. https://azure.microsoft.com/en-us/services/functions/. Accessed February 2024.
- Google Cloud Functions. https://cloud.google.com/functions/. Accessed February 2024.
- Apache OpenWhisk. http://openwhisk.apache.org/. Accessed February 2024.
- KNIX MicroFunctions. https://github.com/knix-microfunctions/knix. Accessed February 2024.
- M. Ciavotta, D. Motterlini, M. Savi, and A. Tundo, “DFaaS: Decentralized function-as-a-service for federated edge computing,” in IEEE 10th International Conference on Cloud Networking (CloudNet). IEEE, 2021, pp. 1–4.
- M. Malawski, A. Gajek, A. Zima, B. Balis, and K. Figiela, “Serverless execution of scientific workflows: Experiments with HyperFlow, AWS Lambda and Google Cloud Functions,” Future Generation Computer Systems, vol. 110, pp. 502–514, 2020.
- E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, “Occupy the cloud: Distributed computing for the 99%,” in Symposium on Cloud Computing. ACM, 2017, p. 445–451.
- V. Shankar, K. Krauth, K. Vodrahalli, Q. Pu, B. Recht, I. Stoica, J. Ragan-Kelley, E. Jonas, and S. Venkataraman, “Serverless linear algebra,” in 11th ACM Symposium on Cloud Computing. ACM, 2020, p. 281–295.
- B. Carver, J. Zhang, A. Wang, A. Anwar, P. Wu, and Y. Cheng, “Wukong: A scalable and locality-enhanced framework for serverless parallel computing,” in 11th ACM Symposium on Cloud Computing. ACM, 2020, p. 1–15.
- A. Mahgoub, K. Shankar, S. Mitra, A. Klimovic, S. Chaterji, and S. Bagchi, “SONIC: Application-aware data passing for chained serverless applications,” in USENIX Annual Technical Conference (ATC), 2021, pp. 285–301.
- A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher, M. Brafman, G. Petretto, G.-M. Rignanese, G. Hautier, D. Gunter, and K. A. Persson, “FireWorks: A dynamic workflow system designed for high-throughput applications,” Concurrency and Computation: Practice and Experience, vol. 27, no. 17, pp. 5037–5059, 2015.
- Luigi. https://github.com/spotify/luigi. Accessed February 2024.
- Apache Airflow. https://airflow.apache.org/. Accessed February 2024.
- M. H. Hilman, M. A. Rodriguez, and R. Buyya, “Task runtime prediction in scientific workflows using an online incremental learning approach,” in IEEE/ACM 11th International Conference on Utility and Cloud Computing (UCC), 2018, pp. 93–102.
- M. R. Wyatt, S. Herbein, T. Gamblin, A. Moody, D. H. Ahn, and M. Taufer, “PRIONN: Predicting runtime and io using neural networks,” in 47th International Conference on Parallel Processing, 2018.
- R. F. da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, and M. Livny, “Toward fine-grained online task characteristics estimation in scientific workflows,” in 8th Workshop on Workflows in Support of Large-Scale Science, 2013.
- F. Nadeem, D. Alghazzawi, A. Mashat, K. Fakeeh, A. Almalaise, and H. Hagras, “Modeling and predicting execution time of scientific workflows in the grid using radial basis function neural network,” Cluster Computing, vol. 20, pp. 2805–2819, 2017.
- J. Yu, M. Gao, Y. Li, Z. Zhang, W. H. Ip, and K. L. Yung, “Workflow performance prediction based on graph structure aware deep attention neural network,” Journal of Industrial Information Integration, vol. 27, p. 100337, 2022.
- M. Bux, J. Brandt, C. Lipka, K. Hakimzadeh, J. Dowling, and U. Leser, “SAASFEE: Scalable scientific workflow execution engine,” Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1892–1895, 2015.
- Y. Dai and X. Zhang, “A synthesized heuristic task scheduling algorithm,” The Scientific World Journal, vol. 2014, 2014.
- J. G. Barbosa and B. Moreira, “Dynamic scheduling of a batch of parallel task jobs on heterogeneous clusters,” Parallel Computing, vol. 37, no. 8, pp. 428–438, 2011.
- R. Kumar, M. Baughman, R. Chard, Z. Li, Y. Babuji, I. Foster, and K. Chard, “Coding the computing continuum: Fluid function execution in heterogeneous computing environments,” in Heterogeneity in Computing Workshop, 2021.
- I. Casas, J. Taheri, R. Ranjan, L. Wang, and A. Y. Zomaya, “A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems,” Future Generation Computer Systems, vol. 74, pp. 168–178, 2017.