- The paper introduces an open benchmark framework that standardizes datasets and evaluation protocols for medium-range weather forecasting.
- It employs a probabilistic framework with metrics like RMSE, ACC, CRPS, and SEEPS to assess prediction realism and uncertainty.
- The paper demonstrates competitive AI performance versus traditional models, achieving IFS-level skill up to six days with unique spectral insights.
WeatherBench 2: A Comprehensive Benchmark for Data-Driven Global Weather Models
The paper "WeatherBench 2: A benchmark for the next generation of data-driven global weather models" offers an updated framework aimed at enhancing and evaluating the progress of data-driven approaches in global medium-range weather forecasting. It builds on its predecessor, WeatherBench 1, facilitating comparisons between traditional numerical weather prediction (NWP) models and data-driven methodologies driven by advances in AI.
Benchmark Scope and Evaluation Protocol
WeatherBench 2 focuses on evaluating models over medium-range forecast horizons (1–14 days), inherently critical due to the occurrence of significant weather events like cyclones and heat waves on this scale. The benchmark includes an open-source evaluation framework, standardized metrics, training datasets, and a regularly updated public leaderboard. The provided evaluation protocol aligns closely with the established practices in leading operational weather centers such as ECMWF, using a suite of deterministic and probabilistic metrics like RMSE, ACC, CRPS, and SEEPS.
Design Decisions and Challenges
A notable aspect of WeatherBench 2 is the attention given to designing a benchmark that accommodates the high-dimensional nature of weather prediction. This involves balancing diverse aspects including model formulation, input data specificity, and forecast system design while emphasizing a probabilistic framework to address inherent uncertainty.
Significant challenges outlined include the representation of unresolved physical processes in NWP and the fidelity of data-driven models. ERA5 serves as the source for both training data and ground truth—a decision based on its comprehensive temporal and spatial coverage, although it presents caveats in accuracy concerning certain variables, notably precipitation.
The paper also explores the intricate balance between forecast realism and model smoothing, a frequent characteristic of AI models resulting from their tendency to favor mean error minimization.
Results and Observations
The evaluation results demonstrate competitive performance of leading AI methodologies like GraphCast, Pangu-Weather, and NeuralGCM against ECMWF's operational IFS models. Notably, the deterministic AI models achieve skill levels equivalent to IFS HRES up to 6 days lead time, with NeuralGCM showcasing impressive long-range capabilities alongside its ENS counterpart in probabilistic metrics. However, the results highlight the challenge of blurring in AI forecasts, evident in both spectral analysis and spatial bias evaluations.
Zonal spectral energy analysis reveals distinctive energy patterns between AI models and traditional NWP, indicating divergent approaches to small-scale variability and resolution. The bias assessment further underscores the need for refined learning of variable interdependencies, particularly the correlations that impact wind speed predictions.
Future Directions and Implications
WeatherBench 2 positions itself as a pivotal step in accelerating the development of global data-driven weather models. Its robust and evolving framework promises to accommodate new advances in AI and weather prediction methodologies. The discussions concede the need for enhanced probabilistic forecasting, integration of direct observational data, and calibrated ensemble methods, all critical to overcoming current hurdles in capturing weather extremes and improving realism.
This initiative invites further exploration into hybrid models integrating AI with physical dynamics, potentially leading to more accurate and reliable forecasts. Additionally, by providing a standardized basis for comparison, WeatherBench 2 can guide researchers in innovating AI-based approaches that blend improved computational capabilities with the expertise of meteorological sciences.
In conclusion, WeatherBench 2 serves as more than a benchmark—it acts as a catalyst for collaboration and continuous improvement in the evolving intersection of AI and atmospheric sciences, advocating a future of synergistic advancements in our understanding and prediction of weather phenomena.