ClimSim-Online: A Large Multi-scale Dataset and Framework for Hybrid ML-physics Climate Emulation (2306.08754v6)

Published 14 Jun 2023 in cs.LG and physics.ao-ph

Abstract: Modern climate projections lack adequate spatial and temporal resolution due to computational constraints, leading to inaccuracies in representing critical processes like thunderstorms that occur on the sub-resolution scale. Hybrid methods combining physics with ML offer faster, higher fidelity climate simulations by outsourcing compute-hungry, high-resolution simulations to ML emulators. However, these hybrid ML-physics simulations require domain-specific data and workflows that have been inaccessible to many ML experts. As an extension of the ClimSim dataset (Yu et al., 2024), we present ClimSim-Online, which also includes an end-to-end workflow for developing hybrid ML-physics simulators. The ClimSim dataset includes 5.7 billion pairs of multivariate input/output vectors, capturing the influence of high-resolution, high-fidelity physics on a host climate simulator's macro-scale state. The dataset is global and spans ten years at a high sampling frequency. We provide a cross-platform, containerized pipeline to integrate ML models into operational climate simulators for hybrid testing. We also implement various ML baselines, alongside a hybrid baseline simulator, to highlight the ML challenges of building stable, skillful emulators. The data (https://huggingface.co/datasets/LEAP/ClimSim_high-res) and code (https://leap-stc.github.io/ClimSim and https://github.com/leap-stc/climsim-online) are publicly released to support the development of hybrid ML-physics and high-fidelity climate simulations.

PDF Abstract

An Expert Review of "ClimSim: A Large Multi-Scale Dataset for Hybrid Physics-ML Climate Emulation"

The research paper, "ClimSim: A Large Multi-Scale Dataset for Hybrid Physics-ML Climate Emulation," introduces ClimSim, a dataset designed to facilitate hybrid ML and physics-based climate simulation. This extensive dataset aims to bridge the gap between computational constraints faced by conventional climate simulators and the fidelity required to predict critical processes such as storms, convective cloud systems, and extreme rainfall with higher accuracy.

Overview and Dataset Construction

ClimSim represents a significant advancement as it comprehensively includes 5.7 billion multivariate input-output pairs derived from multi-scale climate simulations. Its design reflects the consortium's attempt to rigorously address the need for high-resolution data that captures the local and nested small-scale physics influencing macro-scale physical state variables within a climate simulator. The dataset's global coverage, spanning multiple years, is noteworthy for its granular temporal sampling which enhances the potential for operational coupling with climate simulators.

The data acquisition involved running a high-resolution multi-scale climate simulator, specifically the E3SM-MMF, on advanced GPU-based systems, resulting in remarkable computational efforts spread over thousands of GPU-hours. Importantly, ClimSim extends beyond a medium to develop ML models by operationally including expanded input and output vectors to simulate a full range of atmospheric processes integral to climate simulations.

Baseline Models and Performance Evaluation

To demonstrate the dataset's applicability, the paper outlines experiments using several ML methodologies, notably convolutional neural networks (CNNs), encoder-decoder networks, heteroskedastic regression, and random ensemble methods, amongst others. These baselines provide a comprehensive overview of the challenge posed by ClimSim, particularly the emulation of tendencies in temperature (dT/dt) and humidity (dq/dt), which are core to resolving convection and cloud processes at a resolution unmatched by current simulators.

Quantitatively, the paper reports variations in mean absolute error (MAE) and R² metrics across different model architectures, offering invaluable insight into the efficacy of ML approaches in capturing the deterministic and stochastic nature of sub-grid processes. The superior performance of MLP models in low-atmosphere conditions and the enhanced skill exhibited by stochastic models in the upper atmosphere highlight the dataset's capability to reveal distinct advantages in diverse modeling strategies.

Implications and Future Speculation

The release of an open-access dataset like ClimSim has significant implications for advancing the integration of ML techniques in climate science. The data supports long-term goals of developing hybrid climate simulation models that leverage the computational efficiency of ML to emulate the detailed physics currently resolved by nested simulators. By facilitating engagement across disciplinary boundaries, ClimSim holds potential for improving model accuracy and operational climate predictions.

Furthermore, the research calls for further exploration into hybrid testing workflows, suggesting a natural evolution towards enabling ML models to work synergistically within established physical climate modeling frameworks. The inclusion of stochastic modeling components is particularly emphasized, reflecting real-world atmospheric variability better and promising more robust emulator design.

In conclusion, ClimSim stands as a foundational effort to push the frontier of hybrid ML-physics climate modeling. While it exposes challenges, such as the need for operational testing frameworks and multi-climate extensions, it opens several avenues for future exploration, including the potential application of dimensionality reduction techniques for enhanced interpretability and causal pruning for optimizing model input selection. The dataset could significantly shift paradigms in computational climate science, ultimately promising better-informed policy decisions based on high-accuracy simulations.

PDF Markdown Bookmark Chat (Pro)

Authors (47)

Sungduk Yu (16 papers)
Walter Hannah (2 papers)
Liran Peng (6 papers)
Jerry Lin (9 papers)
Mohamed Aziz Bhouri (11 papers)
Ritwik Gupta (23 papers)
Björn Lütjens (20 papers)
Gunnar Behrens (5 papers)
Nora Loose (1 paper)
Tom Beucler (31 papers)
Bryce Harrop (4 papers)
Andrea Jenney (1 paper)
Savannah Ferretti (1 paper)
Nana Liu (54 papers)
Veronika Eyring (16 papers)
Pierre Gentine (51 papers)
Stephan Mandt (100 papers)
Akshay Subramaniam (10 papers)
Rose Yu (84 papers)
Laure Zanna (32 papers)

Citations (10)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos