Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Numerical models outperform AI weather forecasts of record-breaking extremes (2508.15724v1)

Published 21 Aug 2025 in physics.ao-ph, cs.AI, and stat.AP

Abstract: AI-based models are revolutionizing weather forecasting and have surpassed leading numerical weather prediction systems on various benchmark tasks. However, their ability to extrapolate and reliably forecast unprecedented extreme events remains unclear. Here, we show that for record-breaking weather extremes, the numerical model High RESolution forecast (HRES) from the European Centre for Medium-Range Weather Forecasts still consistently outperforms state-of-the-art AI models GraphCast, GraphCast operational, Pangu-Weather, Pangu-Weather operational, and Fuxi. We demonstrate that forecast errors in AI models are consistently larger for record-breaking heat, cold, and wind than in HRES across nearly all lead times. We further find that the examined AI models tend to underestimate both the frequency and intensity of record-breaking events, and they underpredict hot records and overestimate cold records with growing errors for larger record exceedance. Our findings underscore the current limitations of AI weather models in extrapolating beyond their training domain and in forecasting the potentially most impactful record-breaking weather events that are particularly frequent in a rapidly warming climate. Further rigorous verification and model development is needed before these models can be solely relied upon for high-stakes applications such as early warning systems and disaster management.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that ECMWF’s HRES consistently outperforms AI models in forecasting record-breaking heat, cold, and wind events.
  • The study employs a comprehensive benchmark using historical data and metrics such as latitude-weighted RMSE, bias, and precision-recall to evaluate model performance.
  • Analysis reveals that AI models exhibit systematic bias and limited extrapolation capability due to training data constraints, underscoring the need for hybrid forecasting approaches.

Evaluation of AI and Numerical Weather Prediction Models for Record-Breaking Extremes

Introduction

The paper "Numerical models outperform AI weather forecasts of record-breaking extremes" (2508.15724) presents a systematic evaluation of state-of-the-art AI weather models—GraphCast, Pangu-Weather, Fuxi, and their operational variants—against the leading numerical weather prediction (NWP) system, ECMWF's High RESolution forecast (HRES), with a focus on forecasting record-breaking heat, cold, and wind events. The paper addresses a critical gap in the literature: while AI models have demonstrated competitive or superior performance on standard benchmarks and moderate extremes, their reliability in extrapolating to unprecedented, high-impact events remains unquantified. The authors construct a benchmark dataset of record-breaking events, defined locally per grid cell and calendar month, and rigorously compare model skill, bias, and classification metrics across multiple years, regions, and seasons.

Benchmark Dataset and Methodology

The benchmark comprises all land-based observations in 2018 and 2020 that exceed the historical monthly maxima or minima from the training period (1979–2017), yielding large sample sizes (e.g., 162,751 heat records in 2020). This approach ensures that evaluation targets true out-of-distribution events, challenging the models' extrapolation capabilities. Forecast skill is quantified using latitude-weighted RMSE, forecast bias, and precision-recall metrics, with careful attention to the forecaster's dilemma and conditioning on both observations and forecasts. Figure 1

Figure 1: Spatial and latitudinal distribution of heat records in 2020, and RMSE of HRES, Pangu-Weather, GraphCast, and Fuxi for all events and record-breaking events across lead times.

Model Performance on Record Intensity

On aggregate metrics, AI models (except Pangu-Weather) outperform HRES in forecasting 2m temperature, and all AI models surpass HRES for 10m wind speed across most lead times. However, when restricted to record-breaking events, HRES consistently yields lower RMSE for heat, cold, and wind records at nearly all lead times, with the performance gap most pronounced at short lead times and persisting across seasons, regions, and both test years (2018 and 2020). Figure 2

Figure 2: Regional RMSE for 2m temperature and 10m wind speed, demonstrating HRES's superior skill in extratropical and tropical zones for record-breaking events.

The operational variants of GraphCast and Pangu-Weather, evaluated against HRES-fc0 ground truth, confirm the robustness of these findings. Conditioning on forecasted records (rather than observed) yields qualitatively similar results, indicating that the superior extrapolation of HRES is not an artifact of evaluation protocol.

Systematic Biases and Extrapolation Limitations

AI models exhibit systematic underprediction of heat and wind record intensities and overprediction of cold records, with forecast bias increasing nearly linearly with the degree of record exceedance. This behavior is consistent across models, regions, and years, and is not observed in HRES, which maintains nearly constant error across record magnitudes. Figure 3

Figure 3: Forecast bias and RMSE as a function of record exceedance margin, highlighting the growing underestimation of event intensity by AI models.

The results suggest an implicit cap in AI model predictions, likely reflecting the bounds of the training data distribution. In contrast, the physics-based HRES model, governed by PDEs and physical constraints, extrapolates more robustly beyond observed extremes.

Prediction of Record Occurrence

AI models systematically underpredict the frequency of record-breaking events, resulting in high false negative rates and low recall. HRES forecasts a number of records comparable to its ground truth, with only slight overestimation for heat records at short lead times. Precision-recall curves and binary correlation metrics consistently favor HRES over AI models for all record types and lead times. Figure 4

Figure 4: Counts of record-breaking events, precision-recall curves, and binary correlation for GraphCast and HRES, demonstrating superior classification skill of HRES.

Similar results are observed for Pangu-Weather and Fuxi, with HRES outperforming in both precision and recall across all lead times.

Theoretical and Practical Implications

The observed extrapolation failure in AI models is consistent with fundamental limitations of neural architectures (transformers, GNNs) in out-of-distribution generalization. The lack of explicit physical constraints and reliance on interpolation within the training domain restricts their ability to forecast unprecedented extremes. Deterministic AI models also tend to smooth fine-scale features, further limiting their utility for high-impact events. Figure 5

Figure 5: Illustration of record and extrapolation definitions, showing test points outside the training data convex hull and record rectangle.

The findings have direct implications for operational forecasting, early warning systems, and disaster management. Sole reliance on current AI models for high-stakes applications is not warranted, especially in a warming climate where record-breaking extremes are increasingly frequent.

Future Directions

Several strategies are proposed to address AI model limitations:

  • Data Augmentation: Leveraging climate model simulations and ensemble boosting to enrich training data with physically plausible extremes.
  • Hybrid Modeling: Integrating AI components into physical models to combine learning capacity with physical consistency and extrapolation ability.
  • Extreme-Oriented Loss Functions: Adapting statistical learning and EVT principles to improve model skill on rare, high-impact events.

Recent advances in probabilistic AI weather models and hybrid architectures (e.g., NeuralGCM) offer promising avenues, but rigorous evaluation on out-of-distribution extremes remains essential.

Conclusion

The paper provides compelling evidence that current AI weather models, despite their efficiency and skill on average conditions, underperform leading NWP systems in forecasting record-breaking extremes. Systematic underestimation of intensity and frequency, increasing bias with record margin, and limited extrapolation capacity highlight the need for continued development and parallel operation of both AI and numerical models. Future progress will depend on targeted data augmentation, hybrid approaches, and loss function innovation to address the fundamental challenges of out-of-distribution generalization in neural weather forecasting.