- The paper demonstrates that HRES, a numerical model, consistently outperforms leading AI models in forecasting record-breaking heat, cold, and wind extremes.
- It employs latitude-weighted RMSE, bootstrap resampling, and precision-recall analysis to rigorously evaluate forecasting skill on a benchmark dataset from 2018 and 2020.
- The study highlights AI models' inherent limitations in extrapolating beyond training data, suggesting remedies like data augmentation and hybrid modeling.
Evaluation of AI and Numerical Weather Prediction Models for Record-Breaking Extremes
Introduction
The paper "Numerical models outperform AI weather forecasts of record-breaking extremes" (2508.15724) presents a systematic evaluation of state-of-the-art AI weather forecasting models—GraphCast, Pangu-Weather, Fuxi, and their operational variants—against the leading numerical weather prediction (NWP) model HRES from ECMWF. The focus is on the models' ability to forecast record-breaking extremes in temperature and wind, which are critical for early warning systems and disaster management. The paper leverages a benchmark dataset of record-breaking events, defined locally per grid cell and calendar month, to rigorously assess extrapolation performance beyond the training domain.
Benchmark Dataset and Methodology
The benchmark comprises all land-based events in 2018 and 2020 where observed 2m temperature or 10m wind speed exceeded the historical monthly maxima or minima from the ERA5 training period (1979–2017). This yields a substantial sample size (e.g., 162,751 heat records in 2020), enabling robust statistical analysis across seasons and climate zones.
Figure 1: Spatial and latitudinal distribution of heat records in 2020, and RMSE comparison for all events and record-breaking events across models and lead times.
Forecast skill is quantified using latitude-weighted RMSE and forecast bias, with confidence intervals derived via the central limit theorem and bootstrap resampling. Precision-recall curves and binary correlation metrics are employed to assess the models' ability to predict the occurrence of record-breaking events, accounting for both false positives and false negatives.
While AI models (except Pangu-Weather) generally outperform HRES in forecasting 2m temperature for all events, and all AI models surpass HRES for 10m wind speed, this advantage reverses for record-breaking extremes. HRES consistently yields lower RMSE for heat, cold, and wind records across nearly all lead times, with the performance gap most pronounced at short lead times (12–24 hours). The superiority of HRES persists across different years (2018, 2020), ENSO phases, seasons, and climate zones.
Figure 2: Regional RMSE for 2m temperature and 10m wind speed, demonstrating HRES's robustness across hemispheres and the tropics for record-breaking events.
Operational variants of GraphCast and Pangu-Weather, evaluated against HRES-fc0 ground truth, confirm the same pattern: HRES outperforms AI models on records, even when controlling for forecaster's dilemma by conditioning on forecasted rather than observed extremes.
Systematic Biases and Extrapolation Limitations
AI models exhibit a pronounced tendency to underestimate the intensity of heat and wind records and overestimate cold records, with forecast bias increasing nearly linearly with the margin by which the record is exceeded. This behavior is consistent across models, regions, and seasons, indicating a structural limitation in neural network-based extrapolation.
Figure 3: Forecast bias and RMSE as a function of record exceedance margin, highlighting systematic underprediction by AI models and more balanced errors in HRES.
The bias is not merely a function of model architecture but reflects the inability of purely data-driven models to generalize beyond the range of their training data. In contrast, HRES, grounded in physical principles and PDEs, maintains more stable error and bias profiles even for unprecedented extremes.
Prediction of Record Occurrence
AI models systematically underpredict the frequency of record-breaking events, resulting in high false negative rates and low recall. HRES forecasts a number of records comparable to its ground truth, with precision-recall curves consistently closer to the ideal (precision = 1, recall = 1) across all event types and lead times.
Figure 4: Counts of record-breaking events, precision-recall curves, and binary correlation metrics for GraphCast and HRES, demonstrating superior classification skill of HRES.
Correlation analysis reveals that AI models tend to make errors on the same events, likely due to shared biases from common training data. HRES exhibits higher correlation with its ground truth than any AI model with ERA5, further substantiating its reliability for extremes.
Theoretical and Practical Implications
The findings underscore a fundamental challenge in neural network-based weather forecasting: out-of-distribution generalization. AI models, lacking explicit physical constraints, interpolate within the training domain and implicitly cap predictions at observed extremes. This limits their utility for high-stakes applications where accurate extrapolation is essential.
Figure 5: Illustration of record and extrapolation definitions, showing how record-breaking events lie outside the convex hull and univariate range of training data.
Potential remedies include:
- Data Augmentation: Incorporating synthetic extremes from climate model simulations or ensemble boosting to enrich the training set.
- Hybrid Modeling: Integrating AI components into physical models to combine efficiency with physical consistency and extrapolation capability.
- Extreme-Oriented Loss Functions: Adapting training objectives using principles from extreme value theory to prioritize accuracy on rare, high-impact events.
Despite rapid advances, current AI models remain inadequate for sole reliance in early warning and disaster management systems, especially under accelerating climate change.
Conclusion
This paper provides robust evidence that numerical weather prediction models, specifically HRES, outperform leading AI models in forecasting record-breaking temperature and wind extremes. The limitations of AI models are rooted in their inability to extrapolate beyond the training domain, manifesting as systematic underestimation of event intensity and frequency. While AI models offer advantages in speed and efficiency for average conditions, their current generation is not suitable for high-stakes forecasting of unprecedented extremes. Continued parallel development and rigorous evaluation of both NWP and AI models are essential, with future research needed to address structural extrapolation challenges in neural network architectures.