Weather-Augmented Benchmark Overview

Updated 19 July 2025

Weather-augmented benchmarks are standardized frameworks that incorporate real or simulated meteorological data into model training and evaluation.
They combine diverse data sources such as reanalysis, in-situ observations, and synthetic weather corruptions to simulate adverse conditions.
These benchmarks enable fair comparisons across numerical, machine learning, and hybrid methods, enhancing forecasting, perception, and impact analysis.

A weather-augmented benchmark is a standardized dataset or evaluation framework that incorporates real or simulated meteorological phenomena into the data used for training, testing, or comparing machine learning models. The purpose of such benchmarks is to rigorously and reproducibly test model performance for tasks where weather plays a critical role, including forecasting, perception under adverse conditions, or assessing impacts on society and downstream applications.

1. Definition and Scope of Weather-Augmented Benchmarks

A weather-augmented benchmark is designed to evaluate algorithms under the influence of real-world meteorological variability, weather-induced artifacts, or weather impacts across domains. Such benchmarks may include raw observations (e.g., surface station measurements (Zambon et al., 16 Jun 2025, Jin et al., 14 Sep 2024)), post-processed model outputs (Rasp et al., 2020, Rasp et al., 2023), data with weather-simulated corruptions (Mots'oehli et al., 7 Jul 2025, Kuang et al., 16 Mar 2025, Li et al., 3 Feb 2024), or aligned multi-modal records spanning sensor, remote-sensing, and climate event data (Fu et al., 10 Apr 2025). Their use enables fair comparisons between different algorithmic approaches, including numerical weather prediction (NWP), machine learning, and hybrid/ensemble models. Benchmarks often integrate evaluation metrics specific to weather-related requirements, such as probabilistic calibration, skill scores, or task-specific robustness.

2. Core Dataset Modalities and Construction

Weather-augmented benchmarks draw on several forms of meteorological data, each with unique properties and implications for model development:

Reanalysis and NWP Products: Datasets such as WeatherBench (Rasp et al., 2020), WeatherBench 2 (Rasp et al., 2023), and ChaosBench (Nathaniel et al., 1 Feb 2024) use global, gridded, assimilated atmospheric data (e.g., ERA5), sometimes downsampled or aligned with operational NWP baselines.
In-Situ Observations and Real-World Measurements: WeatherReal (Jin et al., 14 Sep 2024) and PeakWeather (Zambon et al., 16 Jun 2025) offer dense, surface-station observations with high spatiotemporal resolution, capturing local features unresolved by reanalysis.
Augmented and Simulated Weather Effects: In domains such as perception, weather-augmented benchmarks apply synthetic weather corruptions (e.g., fog, rain, flare) and refractive distortions to imagery or LiDAR point clouds to evaluate robustness (Mots'oehli et al., 7 Jul 2025, Kuang et al., 16 Mar 2025, Li et al., 3 Feb 2024).
Multimodal and Multitask Integration: Datasets like ClimateBench-M (Fu et al., 10 Apr 2025) and WeatherQA (Ma et al., 17 Jun 2024) align weather time series, event/case data, and imagery (including satellite and radar), enabling multitask training: classification, segmentation, anomaly detection, and question answering pertaining to weather events or impacts.

Benchmarks typically include carefully engineered quality control processes to ensure accuracy and reliability. For instance, WeatherReal applies a multi-step QC pipeline—including value range checks, Gaussian fitting with median absolute deviation (MAD) scaling, clustering (DBSCAN), and cross-validation with neighboring stations—to retain only physically consistent and locally relevant records (Jin et al., 14 Sep 2024).

3. Evaluation Metrics and Benchmarking Methodologies

Weather-augmented benchmarks define task-appropriate, domain-specific scoring metrics. Among the most widely adopted:

Error Measures: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), often latitude-weighted for geophysical fields (Rasp et al., 2020, Rasp et al., 2023).
Probabilistic Metrics: Continuous Ranked Probability Score (CRPS), spread-skill ratio, and prediction interval coverage probability (PICP) evaluate probabilistic forecast sharpness and calibration (Garg et al., 2022, Brenowitz et al., 27 Jan 2024).
Skill Score (SS): Defined as $SS = 1 - \mathrm{RMSE_{ml}} / \mathrm{RMSE_{nwp}}$ , measuring improvement over NWP baselines (Wang et al., 2018).
Robustness Under Corruption: For perception tasks under synthetic weather, peak signal-to-noise ratio (PSNR), endpoint error (EPE), and custom “mean stability rate” (mSR) statistics assess preservation of performance in degraded conditions (Mots'oehli et al., 7 Jul 2025, Kuang et al., 16 Mar 2025).
Physical/Spectral Consistency: Metrics like spectral divergence (SpecDiv) and spectral residual (SpecRes) compare the power spectra of predictions against references to diagnose loss of physical detail under extended forecasts (Nathaniel et al., 1 Feb 2024).
Task-Specific Indices: For perception, visibility estimation is compared against instrumented ground truth following aviation standards (e.g., error within 1.5 miles per ASTM F3673-23 on AIR-VIEW (Mourning et al., 26 Jun 2025)). In text-based impact tasks, row-wise multi-label accuracy provides a stringent measure of comprehensive impact understanding (Yu et al., 26 May 2025).

Benchmarks frequently require training/test splits by time or location, with dedicated held-out periods or sites to avoid leakage and overfitting. Reproducibility is supported by open-source data and code releases with standardized scripts for metric computation (Rasp et al., 2020, Jin et al., 14 Sep 2024).

4. Reference Models and Baselines

Weather-augmented benchmarks are typically accompanied by a suite of reference models:

Physical and NWP Models: Operational NWP models (e.g., ECMWF IFS, CFSv2, ICON-CH1-EPS) provide strong baselines and an upper bound for purely data-driven methods (Rasp et al., 2023, Singh et al., 2022, Zambon et al., 16 Jun 2025).
Machine Learning Approaches: Convolutional neural networks, transformers, graph neural networks, and generative models (e.g., DUQ (Wang et al., 2018), MFMGCN (Zhu et al., 2023), SGM (Fu et al., 10 Apr 2025)) form a basis for comparison. Benchmarks often highlight where ML models outperform classical baselines—and, crucially, where they still lag in capturing extremes or uncertainty (Garg et al., 2022, Rasp et al., 2023).
Hybrid/Ensemble Methods: Hybrid models that fuse ML with NWP (as in DeepNWP (Singh et al., 2022) or post-processing frameworks like CAMT (Tang et al., 2023)) demonstrate success, particularly for bias correction and probabilistic quantification.
Restoration and Robustness Modules: For physically degraded input (as in LiDAR or camera images), plug-and-play restoration models (e.g., ResLPRNet (Kuang et al., 16 Mar 2025), ResUNet or DDPM for dashcam imagery (Mots'oehli et al., 7 Jul 2025)) provide resilience against weather artifacts.

Benchmark results may indicate trade-offs between deterministic and probabilistic skill, effects of ensemble strategies, or explain sources of error or calibration failures (Brenowitz et al., 27 Jan 2024, Garg et al., 2022).

5. Applications and Societal Impact

The development of weather-augmented benchmarks serves several critical application areas:

Operational Weather and Climate Prediction: Improved ensemble and uncertainty-aware predictions for temperature, wind, precipitation, and severe events drive better decision support for agriculture, disaster risk reduction, and water management (Wang et al., 2018, Nathaniel et al., 1 Feb 2024).
Robust Environmental Perception: Perception benchmarks for computer vision (e.g., AIR-VIEW, ResLPR, weather-augmented dashcam datasets) specifically address the reliability of AI models in safety-critical settings, such as autonomous driving under adverse weather (Mourning et al., 26 Jun 2025, Mots'oehli et al., 7 Jul 2025, Kuang et al., 16 Mar 2025).
Extreme Event Detection and Impact Understanding: Benchmarks evaluating both weather extremes and their impacts—ranging from thunderstorm anomaly alerts (Fu et al., 10 Apr 2025) to retrieval and multi-label classification of historical impact narratives (Yu et al., 26 May 2025)—provide necessary infrastructure for supporting climate resilience and adaptation strategies.
Multimodal Reasoning: Datasets like WeatherQA (Ma et al., 17 Jun 2024) enable the assessment of multimodal and domain-informed models in tasks that combine image, sensor, and textual evidence for reasoning about severe weather.

6. Challenges and Future Directions

Current weather-augmented benchmarks reveal persistent challenges:

Representation of Uncertainty: Despite improvements, many ML models exhibit underdispersed ensembles and insufficient coverage of forecast uncertainty, especially when trained with multi-step loss functions that favor sharp point estimates over calibrated probabilistic outputs (Brenowitz et al., 27 Jan 2024, Garg et al., 2022).
Physical Realism and Small-Scale Structures: Evaluations with spectral and physical metrics (e.g., SpecDiv, SpecRes in ChaosBench (Nathaniel et al., 1 Feb 2024)) show that data-driven models may capture mean states but lose important small-scale or extreme patterns over longer lead times.
Generalization: Cross-dataset generalization is an open problem for perception benchmarks, as models trained on one dataset often underperform when tested on diverse, real-world images (Mourning et al., 26 Jun 2025).
Data Quality and Ground Truth: Many efforts now focus on in-situ observations as the gold standard for high-impact variables (e.g., weather station measures for temperature, wind, clouds, and precipitation), due to the limitations of gridded reanalysis (Jin et al., 14 Sep 2024).
Multi-task and Multimodal Integration: As research advances toward AGI in climate science, there is a movement toward integrating time series, images, and event records (ClimateBench-M (Fu et al., 10 Apr 2025)), as well as domain-specific severe event reasoning (WeatherQA (Ma et al., 17 Jun 2024)).

The field is expected to evolve toward benchmarks that integrate in-situ, reanalysis, and multi-modal data; include physically grounded and impact-driven metrics; and facilitate reproducible comparisons across a spectrum of methodological paradigms.

7. Representative Equations and Table

Key evaluation formulas used widely in weather-augmented benchmarks include:

Metric	Formula	Description
RMSE	$RMSE = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (f_i - t_i)^2 }$	Pointwise error (forecast $f$ , truth $t$ )
Skill Score	$SS = 1 - RMSE_{ml} / RMSE_{nwp}$	Skill vs. NWP baseline
CRPS	$CRPS(F_{\mu,\sigma}, y) = \sigma \{ \frac{y-\mu}{\sigma}[2\Phi(\cdot)-1] + 2\varphi(\cdot) - \frac{1}{\sqrt{\pi}} \}$	Probabilistic skill for Gaussian forecasts
SEEPS	See (Rasp et al., 2023), scoring matrix based on precipitation category probability	Precipitation error score (accounts for dry/light/heavy)

These formulations enable researchers to rigorously quantify progress and dissect forecast errors in a manner sensitive to meteorological conventions and societal need.

Weather-augmented benchmarks have become the foundation for rapid advances in weather and climate modeling, perception under environmental variability, and disaster impact assessment. Their continued development is central to both methodological innovation in machine learning and the practical realization of robust, reliable, and application-focused weather intelligence.