A Practical Probabilistic Benchmark for AI Weather Models (2401.15305v2)
Abstract: Since the weather is chaotic, forecasts aim to predict the distribution of future states rather than make a single prediction. Recently, multiple data driven weather models have emerged claiming breakthroughs in skill. However, these have mostly been benchmarked using deterministic skill scores, and little is known about their probabilistic skill. Unfortunately, it is hard to fairly compare AI weather models in a probabilistic sense, since variations in choice of ensemble initialization, definition of state, and noise injection methodology become confounding. Moreover, even obtaining ensemble forecast baselines is a substantial engineering challenge given the data volumes involved. We sidestep both problems by applying a decades-old idea -- lagged ensembles -- whereby an ensemble can be constructed from a moderately-sized library of deterministic forecasts. This allows the first parameter-free intercomparison of leading AI weather models' probabilistic skill against an operational baseline. The results reveal that two leading AI weather models, i.e. GraphCast and Pangu, are tied on the probabilistic CRPS metric even though the former outperforms the latter in deterministic scoring. We also reveal how multiple time-step loss functions, which many data-driven weather models have employed, are counter-productive: they improve deterministic metrics at the cost of increased dissipation, deteriorating probabilistic skill. This is confirmed through ablations applied to a spherical Fourier Neural Operator (SFNO) approach to AI weather forecasting. Separate SFNO ablations modulating effective resolution reveal it has a useful effect on ensemble dispersion relevant to achieving good ensemble calibration. We hope these and forthcoming insights from lagged ensembles can help guide the development of AI weather forecasts and have thus shared the diagnostic code.
- Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
- Spherical fourier neural operators: Learning stable dynamics on the sphere. arXiv preprint arXiv:2306.03838, 2023.
- Extended-range predictions with ecmwf models: Time-lagged ensemble forecasting. Quarterly Journal of the Royal Meteorological Society, 116(494):867–912, 1990.
- Prognostic validation of a neural network unified physics parameterization. Geophys. Res. Lett., 17:2493, June 2018. ISSN 0094-8276. doi: 10.1029/2018GL078510.
- Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. arXiv preprint arXiv:2304.02948, 2023a.
- Fuxi: A cascade machine learning forecasting system for 15-day global weather forecast. npj Climate and Atmospheric Science, 6:190, 2023b. doi: 10.1038/s41612-023-00325-5.
- Lagged ensembles, forecast configuration, and seasonal predictions. Monthly weather review, 141(10):3477–3497, 2013.
- ECMWF. IFS Documentation CY46R1 - Part V: Ensemble Prediction System. Number 5. 2019 2019a. doi: 10.21957/38yug0cev. URL https://www.ecmwf.int/node/19309.
- ECMWF. IFS Documentation CY46R1 - Part V: Ensemble Prediction System. Number 5. 2019 2019b. doi: 10.21957/38yug0cev. URL https://www.ecmwf.int/node/19309.
- Why should ensemble spread match the RMSE of the ensemble mean? J. Hydrometeorol., 15(4):1708–1713, August 2014. ISSN 1525-755X, 1525-7541. doi: 10.1175/JHM-D-14-0008.1.
- Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc., 102(477):359–378, March 2007. ISSN 0162-1459. doi: 10.1198/016214506000001437.
- The era5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020.
- Lagged average forecasting, an alternative to monte carlo forecasting. Tellus A, 35A(2):100–118, March 1983. ISSN 0280-6495, 1600-0870. doi: 10.1111/j.1600-0870.1983.tb00189.x.
- C Jablonowski and DL Williamson. A baroclinic wave test case for dynamical cores of general circulation models: Model intercomparisons. ncar tech. Technical report, Note NCAR/TN-469+ STR, National Center for Atmospheric Research, Boulder …, 2006.
- Ryan Keisler. Forecasting global weather with graph neural networks. arXiv preprint arXiv:2202.07575, 2022.
- Neural general circulation models. arXiv preprint arXiv:2311.07222, 2023.
- Learning skillful medium-range global weather forecasting. Science, page eadi2336, 2023.
- C. E. Leith. Theoretical Skill of Monte Carlo Forecasts. Monthly Weather Review, 102:409–418, 1974. doi: 10.1175/1520-0493(1974)102<0409:TSOMCF>2.0.CO;2.
- Martin Leutbecher. On ensemble prediction using singular vectors started from forecasts. Monthly weather review, 133(10):3038–3046, 2005.
- Edward N Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci., 20(2):130–141, March 1963. ISSN 0022-4928. doi: 10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.
- The ECMWF ensemble prediction system: Methodology and validation. Quart. J. Roy. Meteor. Soc., 122(529):73–119, January 1996. ISSN 0035-9009, 1477-870X. doi: 10.1002/qj.49712252905.
- Climax: A foundation model for weather and climate. arXiv preprint arXiv:2301.10343, 2023.
- Stochastic parametrization and model uncertainty. 2009.
- Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv preprint arXiv:2202.11214, 2022.
- Gencast: Diffusion-based ensemble forecasting for medium-range weather. arXiv preprint arXiv:2312.15796, 2023.
- Using bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev., 133(5):1155–1174, May 2005. ISSN 0027-0644. doi: 10.1175/MWR2906.1.
- Data-driven medium-range weather prediction with a resnet pretrained on climate simulations: A new model for weatherbench. Journal of Advances in Modeling Earth Systems, 13(2):e2020MS002405, 2021.
- Weatherbench 2: A benchmark for the next generation of data-driven global weather models, 2023a.
- Weatherbench 2: A benchmark for the next generation of data-driven global weather models. arXiv preprint arXiv:2308.15560, 2023b.
- An evaluation of numerical weather prediction based rainfall forecasts. Hydrological Sciences Journal, 61(15):2704–2717, 2016.
- The tigge project and its achievements. Bulletin of the American Meteorological Society, 97(1):49–67, 2016.
- A moist aquaplanet variant of the held–suarez test for atmospheric model dynamical cores. Geoscientific Model Development, 9(4):1263–1292, 2016.
- Ensemble forecasting at ncep and the breeding method. Monthly Weather Review, 125(12):3297–3319, 1997.
- Monthly enso forecast skill and lagged ensemble size. Journal of Advances in Modeling Earth Systems, 10(4):1074–1086, 2018.
- SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
- Lagged ensembles in sub-seasonal predictions. Quarterly Journal of the Royal Meteorological Society, 147(739):3227–3242, 2021.
- Can machines learn to predict weather? using deep learning to predict gridded 500-hpa geopotential height from historical weather data. Journal of Advances in Modeling Earth Systems, 11(8):2680–2693, 2019.
- Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere. J. Adv. Model. Earth Syst., 12(9), September 2020. ISSN 1942-2466. doi: 10.1029/2020ms002109.
- Sub-seasonal forecasting with a large ensemble of deep-learning weather prediction models. Journal of Advances in Modeling Earth Systems, 13(7):e2021MS002502, 2021.