Reducing the Impact of I/O Contention in Numerical Weather Prediction Workflows at Scale Using DAOS (2404.03107v1)
Abstract: Operational Numerical Weather Prediction (NWP) workflows are highly data-intensive. Data volumes have increased by many orders of magnitude over the last 40 years, and are expected to continue to do so, especially given the upcoming adoption of Machine Learning in forecast processes. Parallel POSIX-compliant file systems have been the dominant paradigm in data storage and exchange in HPC workflows for many years. This paper presents ECMWF's move beyond the POSIX paradigm, implementing a backend for their storage library to support DAOS -- a novel high-performance object store designed for massively distributed Non-Volatile Memory. This system is demonstrated to be able to outperform the highly mature and optimised POSIX backend when used under high load and contention, as per typical forecast workflow I/O patterns. This work constitutes a significant step forward, beyond the performance constraints imposed by POSIX semantics.
- ”About Our Forecasts”, 2024. https://www.ecmwf.int/en/forecasts/documentation-and-support
- ”Lustre Best Practices”, 2024. https://www.nas.nasa.gov/hecc/support/kb/lustre-best-practices_226.html
- G. Lockwood, ”What’s so bad about POSIX I/O?”. The Next Platform 2017. https://www.nextplatform.com/2017/09/11/whats-bad-posix-io/
- ”NEXTGenIO User Guide and Applications”, 2024. https://ngioproject.github.io/nextgenio-docs/html/index.html
- ”Access to Archive Datasets”, 2024. https://www.ecmwf.int/en/forecasts/access-forecasts/access-archive-datasets
- ”DAOS Foundation”, 2024. https://foundation.daos.io
- ”DAOS Architecture”, 2024. https://docs.daos.io/latest/overview/architecture
- ”DAOS File System”, 2024. https://docs.daos.io/v2.4/user/filesystem/
- ”Destination Earth”, 2024. https://destination-earth.eu
- ”Warm World”, 2024. https://warmworld.de
- ”OpenCUBE”, 2024. https://horizon-opencube.eu
- ”European Pilot for Exascale”, 2024. https://eupex.eu