Infrastructure Engineering: A Still Missing, Undervalued Role in the Research Ecosystem (2405.10473v2)
Abstract: Research has become increasingly reliant on software, serving as the driving force behind bioinformatics, high performance computing, physics, machine learning and artificial intelligence, to name a few. While substantial progress has been made in advocating for the research software engineer, a kind of software engineer that typically works directly on software and associated assets that go into research, little attention has been placed on the workforce behind research infrastructure and innovation, namely compilers and compatibility tool development, orchestration and scheduling infrastructure, developer environments, container technologies, and workflow managers. As economic incentives are moving toward different models of cloud computing and innovating is required to develop new paradigms that represent the best of both worlds, an effort called "converged computing," the need for such a role is not just ideal, but essential for the continued success of science. While scattered staff in non-traditional roles have found time to work on some facets of this space, the lack of a larger workforce and incentive to support it has led to the scientific community falling behind. In this article we will highlight the importance of this missing layer, providing examples of how a missing role of infrastructure engineer has led to inefficiencies in the interoperability, portability, and reproducibility of science. We suggest that an inability to allocate, provide resources for, and sustain individuals to work explicitly on these technologies could lead to possible futures that are sub-optimal for the continued success of our scientific communities.
- A. Noor, “Improving bioinformatics software quality through incorporation of software engineering practices,” PeerJ Comput Sci, vol. 8, p. e839, Jan. 2022.
- L. Wratten, A. Wilm, and J. Göke, “Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers,” Nat. Methods, vol. 18, no. 10, pp. 1161–1168, Oct. 2021.
- S. K. Lo, Q. Lu, C. Wang, H.-Y. Paik, and L. Zhu, “A systematic literature review on federated machine learning: From a software engineering perspective,” ACM Comput. Surv., vol. 54, no. 5, pp. 1–39, May 2021.
- J. A. Keith, V. Vassilev-Galindo, B. Cheng, S. Chmiela, M. Gastegger, K.-R. Müller, and A. Tkatchenko, “Combining machine learning and computational chemistry for predictive insights into chemical systems,” Chem. Rev., vol. 121, no. 16, pp. 9816–9872, Aug. 2021.
- B. Meskó and M. Görög, “A short guide for medical professionals in the era of artificial intelligence,” NPJ Digit Med, vol. 3, p. 126, Sep. 2020.
- C. Jay, R. Haines, and D. S. Katz, “Software must be recognised as an important output of scholarly research,” Nov. 2020.
- J. C. Carver, N. Weber, K. Ram, S. Gesing, and D. Katz, “A survey of the state of the practice for research software in the united states,” PeerJ Comput. Sci., vol. 8, May 2022.
- V. Sochat, N. May, I. Cosden, C. Martinez-Ortiz, and S. Bartholomew, “The research software encyclopedia: A community framework to define research software,” vol. 10, no. 1, p. 2, Mar. 2022.
- S. Hettrick, “A not-so-brief history of research software engineers,” https://www.software.ac.uk/blog/not-so-brief-history-research-software-engineers, accessed: 2024-3-29.
- J. Howison and J. D. Herbsleb, “Scientific software production: incentives and collaboration,” in Proceedings of the ACM 2011 conference on Computer supported cooperative work, ser. CSCW ’11. New York, NY, USA: Association for Computing Machinery, Mar. 2011, pp. 513–522.
- G. M. Kurtzer, V. Sochat, and M. W. Bauer, “Singularity: Scientific containers for mobility of compute,” PLoS One, vol. 12, no. 5, p. e0177459, May 2017.
- R. Priedhorsky, R. Shane Canon, T. Randles, and A. J. Younge, “Minimizing privilege for building HPC containers,” Apr. 2021.
- L. Gerhardt, W. Bhimji, S. Canon, M. Fasel, D. Jacobsen, M. Mustafa, J. Porter, and V. Tsulaia, “Shifter: Containers for HPC,” J. Phys. Conf. Ser., vol. 898, no. 8, p. 082021, Oct. 2017.
- “Forecast: Public cloud services, worldwide, 2021-2027, 2Q23 update,” https://www.gartner.com/en/documents/4509999, accessed: 2024-4-1.
- A. Agrawal, S. K. Lee, J. Silberman, M. Ziegler, M. Kang, S. Venkataramani, N. Cao, B. Fleischer, M. Guillorn, M. Cohen, S. Mueller, J. Oh, M. Lutz, J. Jung, S. Koswatta, C. Zhou, V. Zalani, J. Bonanno, R. Casatuta, C.-Y. Chen, J. Choi, H. Haynie, A. Herbert, R. Jain, M. Kar, K.-H. Kim, Y. Li, Z. Ren, S. Rider, M. Schaal, K. Schelm, M. Scheuermann, X. Sun, H. Tran, N. Wang, W. Wang, X. Zhang, V. Shah, B. Curran, V. Srinivasan, P.-F. Lu, S. Shukla, L. Chang, and K. Gopalakrishnan, “9.1 a 7nm 4-core AI chip with 25.6TFLOPS hybrid FP8 training, 102.4TOPS INT4 inference and Workload-Aware throttling,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, Feb. 2021, pp. 144–146.
- L. Lew and Y. Zhang, “Accurate quantized training (AQT) for TPU v5e,” https://cloud.google.com/blog/products/compute/accurate-quantized-training-aqt-for-tpu-v5e, Nov. 2023, accessed: 2024-4-1.
- E. Adam Paxton, M. Chantry, M. Klower, L. Saffin, and T. Palmer, “Climate modeling in low precision: Effects of both deterministic and stochastic rounding,” J. Clim., vol. 35, no. 4, pp. 1215–1229, Feb. 2022.
- R. Underwood, J. Anderson, and A. Apon, “Measuring network latency variation impacts to high performance computing application performance,” in Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’18. New York, NY, USA: Association for Computing Machinery, Mar. 2018, pp. 68–79.
- W. Pauley, “Cloud provider transparency: An empirical evaluation,” IEEE Security & Privacy, vol. 8, no. 6, pp. 32–39, 2010.
- E. N. Lorenz, “Deterministic nonperiodic flow,” J. Atmos. Sci., vol. 20, no. 2, pp. 130–141, Mar. 1963.
- C. R. Prause, R. Reiners, and S. Dencheva, “Empirical study of tool support in highly distributed research projects,” in 2010 5th IEEE International Conference on Global Software Engineering. IEEE, Aug. 2010, pp. 23–32.
- “History,” https://society-rse.org/about/history/, Jul. 2019, accessed: 2024-2-10.
- Vanessa Sochat, Daniel S. Katz, Ian Cosden, Christina Maimone, Charles Ferenbaugh, Sandra Gesing, Jeffrey Carver, Chris Hill, Nicole Brewer, Peter Vaillancourt, Mark Carroll, Peter Elmer, “A kind-of brief shared early history of US-RSE,” https://us-rse.org/2022-02-06-a-brief-history-of-usrse/, Feb. 2022, accessed: 2024-2-10.
- RSE Australia / New Zealand RSE-AUNZ, “Welcome,” https://rse-aunz.github.io/, accessed: 2024-2-10.
- RSE Asia Association, “RSE asia association,” https://rse-asia.github.io/RSE_Asia/, accessed: 2024-2-10.
- “Essential open source software for science,” https://chanzuckerberg.com/rfa/essential-open-source-software-for-science/, Jun. 2020, accessed: 2024-2-10.
- “Cyberinfrastructure for sustained scientific innovation (CSSI),” https://new.nsf.gov/funding/opportunities/cyberinfrastructure-sustained-scientific, accessed: 2024-2-10.
- A. M. Smith, K. E. Niemeyer, D. S. Katz, L. A. Barba, G. Githinji, M. Gymrek, K. D. Huff, C. R. Madan, A. C. Mayes, K. M. Moerman et al., “Journal of open source software (joss): design and first-year review,” PeerJ Computer Science, vol. 4, p. e147, 2018.
- “The journal of open research software (JORS).” [Online]. Available: https://openresearchsoftware.metajnl.com
- G. Derrick, “Assessing the broader value of research culture: The hidden ref experience,” 2021.
- Vanessa Sochat, “Updated software is valued software,” https://vsoch.github.io/2023/valued-software/, Oct. 2023, accessed: 2024-3-31.
- ——, “The research software ecosystem,” https://vsoch.github.io/2022/rsepedia/, Apr. 2022, accessed: 2024-3-31.
- V. Sochat, “Citelang: Modeling the research software ecosystem,” Journal of Open Source Software, vol. 7, no. 77, p. 4458, 2022. [Online]. Available: https://doi.org/10.21105/joss.04458
- V. Sochat, C. Kniep, and E. Arango, “Hpc containers community survey 2024,” May 2024. [Online]. Available: https://doi.org/10.5281/zenodo.11206333
- S. N. Ajani, P. Khobragade, M. Dhone, B. Ganguly, N. Shelke, and N. Parati, “Advancements in computing: Emerging trends in computational science with Next-Generation computing,” Int J Intell Syst Appl Eng, vol. 12, no. 7s, pp. 546–559, 2024.
- A. V. Goponenko, R. Izadpanah, J. M. Brandt, and D. Dechev, “Towards workload-adaptive scheduling for HPC clusters,” in 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Sep. 2020, pp. 449–453.
- D. Reed, D. Gannon, and J. Dongarra, “Reinventing high performance computing: Challenges and opportunities,” Mar. 2022.
- S. Achar, “Cloud and hpc headway for next-generation management of projects and technologies,” Asian Business Review, vol. 10, pp. 187–192, 12 2020.
- “Total addressable spend,” https://oboloo.com/glossary/total-addressable-spend/, Feb. 2023, accessed: 2024-4-1.
- A. Sunyaev, “Cloud computing,” in Internet Computing: Principles of Distributed Systems and Emerging Internet-Based Technologies, A. Sunyaev, Ed. Cham: Springer International Publishing, 2020, pp. 195–236.
- T. Patki, A. Bertsch, I. Karlin, D. H. Ahn, B. Van Essen, B. Rountree, B. R. de Supinski, and N. Besaw, “Monitoring large scale supercomputers: A case study with the lassen supercomputer,” in 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, Sep. 2021, pp. 468–480.
- “Efficiency,” https://www.google.com/about/datacenters/efficiency/, accessed: 2024-2-10.
- N. C. Thompson and S. Spanuth, “The decline of computers as a general purpose technology,” https://cacm.acm.org/magazines/2021/3/250710-the-decline-of-computers-as-a-general-purpose-technology/fulltext?mobile=false, Mar. 2021, accessed: 2023-8-31.
- D. Reed, D. Gannon, and J. Dongarra, “HPC forecast: Cloudy and uncertain,” https://cacm.acm.org/magazines/2023/2/268939-hpc-forecast/fulltext, Feb. 2023, accessed: 2024-2-10.
- C. Misale, D. J. Milroy, C. E. A. Gutierrez, M. Drocco, S. Herbein, D. H. Ahn, Z. Kaiser, and Y. Park, “Towards standard kubernetes scheduling interfaces for converged computing,” in Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation. Springer International Publishing, 2022, pp. 310–326.
- V. Sochat, A. Culquicondor, A. Ojea, and D. Milroy, “The flux operator,” Sep. 2023.
- C. Misale, M. Drocco, D. J. Milroy, C. E. A. Gutierrez, S. Herbein, D. H. Ahn, and Y. Park, “It’s a scheduling affair: GROMACS in the cloud with the KubeFlux scheduler,” in 2021 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC). IEEE, Nov. 2021, pp. 10–16.
- J. E. Hannay, C. MacLeod, J. Singer, H. P. Langtangen, D. Pfahl, and G. Wilson, “How do scientists develop and use scientific software?” in 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering. IEEE, May 2009, pp. 1–8.
- R. Milewicz, G. Pinto, and P. Rodeghero, “Characterizing the roles of contributors in Open-Source scientific software projects,” in 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, May 2019, pp. 421–432.
- A. Apon, S. Ahalt, V. Dantuluri, C. Gurdgiev, M. Limayem, L. Ngo, and M. Stealey, “High performance computing instrumentation and research productivity in U.S. universities,” Sep. 2010.
- CNCF [Cloud Native Computing Foundation], “WG batch: What’s new and what is next? - marcin wielgus, google & maciej szulik, red hat,” Nov. 2023.
- Honeypot, “Kubernetes: The documentary [PART 1],” Jan. 2022.
- N. Hemsoth, “The UberCloud experiment,” https://www.hpcwire.com/2012/06/28/the_uber-cloud_experiment/, accessed: 2024-4-1.
- A. M. Beltre, P. Saha, M. Govindaraju, A. Younge, and R. E. Grant, “Enabling HPC workloads on cloud infrastructure using kubernetes container orchestration mechanisms,” in 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC). IEEE, Nov. 2019, pp. 11–20.
- A. Reuther, C. Byun, W. Arcand, D. Bestor, B. Bergeron, M. Hubbell, M. Jones, P. Michaleas, A. Prout, A. Rosa, and J. Kepner, “Scalable system scheduling for HPC and big data,” J. Parallel Distrib. Comput., vol. 111, pp. 76–92, Jan. 2018.
- D. Merkel, “Docker: Lightweight linux containers for consistent development and deployment,” Linux J., vol. 2014, no. 239, Mar. 2014.
- F. Mölder, K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch, V. Sochat, J. Forster, S. Lee, S. O. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Köster, “Sustainable data analysis with snakemake,” F1000Res., vol. 10, p. 33, Jan. 2021.
- P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational workflows,” Nat. Biotechnol., vol. 35, no. 4, pp. 316–319, Apr. 2017.
- E. Larsonneur, J. Mercier, N. Wiart, E. L. Floch, O. Delhomme, and V. Meyer, “Evaluating workflow management systems: A bioinformatics use case,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2018, pp. 2773–2775.
- N. Sultana, M. Rüfenacht, A. Skjellum, P. Bangalore, I. Laguna, and K. Mohror, “Understanding the use of message passing interface in exascale proxy applications,” Concurr. Comput., vol. 33, no. 14, Jul. 2021.
- D. W. Walker, “Standards for message-passing in a distributed memory environment,” Oak Ridge National Lab., TN (United States), Tech. Rep., 1992.
- Z. Liu, R. Lewis, R. Kettimuthu, K. Harms, P. Carns, N. Rao, I. Foster, and M. E. Papka, “Characterization and identification of HPC applications at leadership computing facility,” in Proceedings of the 34th ACM International Conference on Supercomputing, ser. ICS ’20, no. Article 29. New York, NY, USA: Association for Computing Machinery, Jun. 2020, pp. 1–12.
- Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anandkumar, “Fourier neural operator for parametric partial differential equations,” Oct. 2020.
- A. G. Carlyle, S. L. Harrell, and P. M. Smith, “Cost-Effective HPC: The community or the cloud?” in 2010 IEEE Second International Conference on Cloud Computing Technology and Science, Nov. 2010, pp. 169–176.
- A. Hayes, “Price transparency: Meaning, costs, improvement,” https://www.investopedia.com/terms/p/pricetransparency.asp, accessed: 2024-4-1.
- G. A. Akerlof, “The market for “lemons”: Quality uncertainty and the market mechanism,” Q. J. Econ., vol. 84, no. 3, pp. 488–500, 1970.
- V. Leis, “Database architects,” https://databasearchitects.blogspot.com/2024/02/ssds-have-become-ridiculously-fast.html, accessed: 2024-4-1.
- J. Cook, “Docker hub,” in Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server, J. Cook, Ed. Berkeley, CA: Apress, 2017, pp. 103–118.
- X. Wang, J. Du, and H. Liu, “Performance and isolation analysis of RunC, gvisor and kata containers runtimes,” Cluster Comput., vol. 25, no. 2, pp. 1497–1513, Apr. 2022.
- “Open container initiative - open container initiative,” https://opencontainers.org/, 2024, accessed: 2024-2-10.
- “tob: Technical oversight board (TOB).”
- M. Baker, “Reproducibility crisis,” nature, vol. 533, no. 26, pp. 353–66, 2016.
- “Fosdem: Kubernetes and HPC: Bare-metal bros,” 2024.
- “wg-image-compatibility.”
- K. Hoste, J. Timmerman, A. Georges, and S. De Weirdt, “EasyBuild: Building software with ease,” in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Nov. 2012, pp. 572–582.
- T. Gamblin, M. LeGendre, M. R. Collette, G. L. Lee, A. Moody, B. R. de Supinski, and S. Futral, “The spack package manager: bringing order to HPC software chaos,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’15, no. Article 40. New York, NY, USA: Association for Computing Machinery, Nov. 2015, pp. 1–12.
- S. Queue, “Scheduling framework,” https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/, accessed: 2024-2-10.
- T. Patki, D. Ahn, D. Milroy, J.-S. Yeom, J. Garlick, M. Grondona, S. Herbein, and T. Scogland, “Fluxion: A scalable Graph-Based resource model for HPC scheduling challenges,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23. New York, NY, USA: Association for Computing Machinery, Nov. 2023, pp. 2077–2088.
- P. Liu and J. Guitart, “Fine-Grained scheduling for containerized HPC workloads in kubernetes clusters,” in 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE, Dec. 2022, pp. 275–284.
- V. Sochat, A. Culquicondor, A. Ojea, and D. Milroy, “The flux operator,” F1000Res., vol. 13, no. 203, p. 203, Mar. 2024.
- “Introducing SUNK: A slurm on kubernetes implementation for HPC and large scale AI,” https://www.coreweave.com/blog/sunk-slurm-on-kubernetes-implementations, accessed: 2024-4-1.
- N. Matsumoto and A. Suda, “bypass4netns: Accelerating TCP/IP communications in rootless containers,” Feb. 2024.
- C. Jaspan, M. Jorde, A. Knight, C. Sadowski, E. K. Smith, C. Winter, and E. Murphy-Hill, “Advantages and disadvantages of a monolithic repository: a case study at google,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice, ser. ICSE-SEIP ’18. New York, NY, USA: Association for Computing Machinery, May 2018, pp. 225–234.
- H. Na, Z.-Q. You, T. Baer, S. Oottikkal, T. Dockendorf, and S. Brozell, “HPC software tracking strategies for a diverse workload,” in 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools). IEEE, Nov. 2020, pp. 1–9.
- Y. Xu, P. Sivaraman, H. Devarajan, K. Mohror, and A. Bhatele, “ML-based modeling to predict I/O performance on different storage sub-systems,” Dec. 2023.
- M. A. Vieira, M. S. Castanho, R. D. Pacífico, E. R. Santos, E. P. C. Júnior, and L. F. Vieira, “Fast packet processing with ebpf and xdp: Concepts, code, challenges, and applications,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–36, 2020.
- E. Larsonneur, J. Mercier, N. Wiart, E. Le Floch, O. Delhomme, and V. Meyer, “Evaluating workflow management systems: A bioinformatics use case,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018, pp. 2773–2775.
- Wikipedia contributors, “Batch processing,” https://en.wikipedia.org/w/index.php?title=Batch_processing&oldid=1149313823, Apr. 2023, accessed: 02-12-2024.
- S. Section, “Jobs,” https://kubernetes.io/docs/concepts/workloads/controllers/job/, accessed: 2024-2-10.
- “KubeCon + CloudNativeCon,” https://www.cncf.io/kubecon-cloudnativecon-events/, accessed: 2024-2-10.
- CNCF [Cloud Native Computing Foundation], “On-Demand systems and scaled training using the JobSet API - abdullah gharaibeh & vanessa sochat,” Nov. 2023.
- ——, “Enabling HPC & ML workloads with the latest kubernetes job features- michał woźniak & vanessa sochat,” May 2023.
- “HPC and cloud converged computing: Merging infrastructures and communities,” https://sc23.supercomputing.org/proceedings/panel/panel_pages/pan110.html, 2023, accessed: 2024-2-10.
- A. Aronczyk, R. Smith, W. Rubin, A. J. Perry, and J. Jiang, “The ice cream conspiracy,” NPR, Feb. 2023.