IndiaWeatherBench: Regional Weather ML Benchmark
- IndiaWeatherBench is an open-source platform offering a comprehensive, high-resolution dataset and standardized evaluation protocols for ML-based regional weather forecasting in India.
- It utilizes curated reanalysis data with 43 channels, covering extensive periods and employing robust data splits for controlled training and evaluation.
- The benchmark supports diverse ML models—including convolutional, transformer, and graph-based architectures—facilitating reproducible research and operational forecasting improvements.
IndiaWeatherBench is a comprehensive, open-source dataset and benchmarking platform specifically tailored to data-driven regional weather forecasting over the Indian subcontinent. Created to address the lack of standardized, high-resolution, and reproducible infrastructure for regional ML-based weather prediction, IndiaWeatherBench builds upon high-fidelity regional reanalysis data and provides curated data splits, robust evaluation protocols, and a suite of strong neural and graph-based forecasting model baselines. The benchmark is designed with modularity and extensibility in mind, facilitating application, comparison, and advancement of ML methods for regional weather forecasting both in India and analogous domains.
1. Dataset Construction and Structure
IndiaWeatherBench is constructed from the Indian Monsoon Data Assimilation and Analysis (IMDAA) reanalysis product—a regional, high-resolution reanalysis produced through a collaboration between the Indian Ministry of Earth Sciences, the UK Met Office, and the India Meteorological Department. The IMDAA reanalysis includes over 57 meteorological variables on 63 pressure levels, covering 1979–2020 at an hourly frequency with a native spatial resolution of 0.12° (about 12 km).
For the benchmark, a curated spatial domain spanning 6°N–36.72°N and 66.6°E–97.25°E is selected, corresponding to a 256×256 grid at native resolution and encompassing the Indian subcontinent and nearby oceanic regions. The data is temporally subsampled at 6-hour intervals (00, 06, 12, 18 UTC), covering 2000–2019 and split into fixed training (26,500 samples), validation (1,500), and test (1,500) sets.
The selected 43 channels include:
- Single-level (surface) variables: 2m temperature, 10m U/V wind, precipitation, mean sea level pressure, and total cloud cover.
- Pressure-level variables: temperature, geopotential height, wind, humidity at 50, 250, 500, 600, 700, 850, and 925 hPa.
- Static fields: terrain height and land cover.
Distribution formats include Zarr (for scalable, cloud-native analysis) and HDF5 (for efficient samplewise loading in deep learning pipelines) (Nguyen et al., 31 Aug 2025).
2. Evaluation Metrics and Scoring Protocols
IndiaWeatherBench establishes a rigorous evaluation suite, supporting both deterministic and probabilistic forecasting skill assessment.
Deterministic Metrics:
- Root Mean Square Error (RMSE):
with latitude-dependent weighting to compensate for varying grid cell areas (, is latitude).
- Anomaly Correlation Coefficient (ACC):
where is climatology; ACC is computed as the cosine of the angle between anomaly vectors, latitude-weighted.
Probabilistic Metrics:
Quantifies calibration and sharpness of probabilistic forecasts.
- Spread/Skill Ratio (SSR):
Ratio of ensemble spread (variance across ensemble members) to RMSE of ensemble mean. SSR close to 1 denotes well-calibrated ensembles.
These metrics support standardization and fair inter-model comparisons, with detailed splitting protocols provided.
3. Baseline Model Architectures
IndiaWeatherBench implements and benchmarks a wide range of ML models, reflecting dominant paradigms in contemporary weather prediction:
Convolutional Baseline (UNet):
- Standard encoder/decoder convolutional network with skip connections, supporting high-resolution spatial prediction with efficient feature fusion.
Transformer-based (Stormer):
- A tailored transformer architecture employing weather-specific tokenization and efficient self-attention, optimized for medium-range, high-dimensional meteorological time-series.
Graph-based Models:
- GraphCast: Multiscale graph-based spatial modeling with nodes associated to locations on an icosahedral mesh, capturing long-range and local spatial dependencies.
- Hierarchical Graph NN (Hi): An extension of GraphCast implementing a multi-resolution hierarchy with vertical edges connecting fine and coarse spatial representations.
Each approach is evaluated under identical data splits and metrics. Hi sometimes underperforms GraphCast, illustrating architectural sensitivities in regional contexts.
4. Boundary Conditioning and External Information
Regional models must address the truncated domain, where boundary conditions exert a critical influence. The benchmark explores:
- Boundary Forcing: High-resolution ground-truth values from the grid edges are provided as “wrap-around” input during training and evaluation, preserving field continuity but requiring high-quality boundary observations.
- Coarse-Resolution Conditioning: Coarse global forecasts (e.g., from ERA5), cropped and interpolated to the benchmark grid, are concatenated as auxiliary channels. This approach aligns with operational realities where only low-resolution global drivers may be available for boundary updates.
Model sensitivity to these different conditioning modes is highlighted, with performance varying by architecture (notably for tokenization-based transformers).
5. Forecasting Objectives and Probabilistic Modeling
Deterministic models are trained to forecast increments:
improving stability and learning dynamics, with a mean squared error loss.
Probabilistic prediction leverages a diffusion-based framework (Elucidated Diffusion Model, EDM). The forecast increment is corrupted with Gaussian noise, and the model is trained on a denoising score-matching objective:
Stochastic inference produces calibrated probabilistic ensemble forecasts, directly supporting CRPS/SSR benchmarking.
6. Extensibility and Reproducibility
Design modularity underpins the benchmark’s extensibility:
- Datasets, pipelines, and code support adaptation to arbitrary spatial domains—users can specify new regional contiguous boundaries, apply the same variable selection, splits, and metrics, and source alternative auxiliary boundary data.
- Model code is format-agnostic (Zarr or HDF5), facilitating workflow integration across research environments and cloud or local computational setups.
- All data, code, and evaluation routines are released open source, ensuring reproducibility and community-driven extension.
7. Scientific and Operational Relevance
IndiaWeatherBench bridges a critical gap between global ML weather benchmarks and the unique demands of regional weather forecasting—where boundary influences, high-impact extremes, and localization are paramount. By providing curated, reproducible infrastructure, domain-matched data, and rigorous evaluation tools, it enables controlled comparison of regional ML architectures and conditioning strategies. This foundation supports robust research into domain-sensitive challenges such as monsoon prediction, cyclone track forecasting, extreme rainfall, and boundary-induced error propagation, directly informing the evolution of high-fidelity, operational weather ML systems in the Indian context and beyond (Nguyen et al., 31 Aug 2025).
 
          