Weather4Cast 2025 Benchmark

Updated 21 November 2025

The Weather4Cast 2025 Benchmark is a comprehensive evaluation suite for nowcasting and short-term weather prediction, emphasizing precipitation forecasting and probabilistic metrics.
It standardizes assessment through tasks like cumulative rainfall and event prediction, employing advanced models such as ConvGRU cascades and vision transformers.
The benchmark leverages operational satellite and radar data with metrics like CRPS and CSI to enable robust model comparison and actionable forecasting insights.

The Weather4Cast 2025 Benchmark is a rigorously defined evaluation suite designed for the assessment and comparison of state-of-the-art deep learning methods in nowcasting and short-term weather prediction, with particular emphasis on precipitation forecasting. It comprises task formulations, data pipelines, model evaluation metrics, and reference baselines, enabling direct comparison across competing methodologies. The benchmark draws substantive influence from operational satellite and radar data streams and stands as a focal point for probabilistic and event-based evaluation in high-resolution meteorological forecasting (Bhuskute et al., 14 Nov 2025).

1. Benchmark Tasks and Data Structure

Weather4Cast 2025 specifies two primary tasks targeting several axes of predictive skill: cumulative rainfall forecasting and precipitation event prediction.

Cumulative Rainfall Forecasting

Objective: Predict total precipitation accumulated over the next four hours at SEVIRI’s native 252×252 pixel (4–12 km per pixel) spatial resolution, subsequently upsampled to match OPERA radar (1512×1512).
Input: Four consecutive SEVIRI IR 10.8 μm channel observations, spaced at 15-minute intervals, spanning the most recent hour.
Output: A single forecast field representing 4-hour cumulative rainfall, reported both deterministically and as probability threshold summaries.
Evaluation Metrics:
- Primary: Continuous Ranked Probability Score (CRPS)
- Secondary/Diagnostics: Critical Success Index (CSI), F1 score at rainfall thresholds (e.g., 0.5 mm/h, 1.0 mm/h), Root Mean Square Error (RMSE), Structural Similarity Index (SSIM).

Event Prediction

Objective: Identify and characterize up to five strongest precipitation “events” during the 4-hour forecast window, including centroid location in the central frame, bounding-box area, temporal duration, and severity (peak intensity).
Evaluation Metrics: F1 score and CSI for event-level detections, based on spatiotemporal overlap between predicted and ground-truth labeled events.

Data and Preprocessing

Source: SEVIRI IR_108 channel; data sampled at 15-minute intervals.
Preprocessing Steps:
- Zero-padding to 256×256 for GPU compatibility.
- Normalization of brightness temperatures via $T' = T / 300\ \mathrm{K}$ .
- Cloud pixel isolation using Otsu’s adaptive threshold; clear-sky pixels are set to unity to bias the loss toward cloud features (Bhuskute et al., 14 Nov 2025).

2. Model Architectures and Training Approaches

Several model paradigms are actively benchmarked, each reflecting different strategies for spatiotemporal pattern extraction and probabilistic forecasting. The most notable approaches include ConvGRU cascades, vision transformer-based frozen encoder architectures, and geometric deep learning on spherical grids.

ConvGRU Encoder–Decoder ("Staggered Cascade")

Pipeline: Two-stage training.
- Stage 1: ConvGRU encoder–decoder, with stacked temporal units, predicts the sequence of future SEVIRI brightness temperatures using MSE loss.
- Stage 2: Empirical nonlinear calibration transforms predicted temperature fields to OPERA-compatible rainfall rates via $\widehat{R}(x, y, t) = \alpha\ \max(0,\, 300 - \widehat{T}(x, y, t))^\beta$ , with $\alpha, \beta$ fit to radar data.
Cascade Structure: Four separate ConvGRU networks, each learning a non-autoregressive prediction at a specific lookahead (1, 2, 3, 4 hours), mitigating error accumulation and supporting parallel inference.
Strengths/Limitations: Efficient for single-pass inference and event detection, but regional microphysical biases and limited calibration of uncertainty beyond post-hoc CDF transformation remain (Bhuskute et al., 14 Nov 2025).

Frozen Satellite Vision Transformer with Probabilistic Head

Architecture: DINOv3-SAT493M (ViT-L/16 pre-trained on 493M satellite patches) serves as a frozen encoder. A lightweight video-projector, based on the V-JEPA transformer, learns to map encoder outputs to a discrete empirical CDF (eCDF) of 4-hour accumulated rainfall.
Objective Alignment: Training is directly aligned with CRPS (and its discrete analog, RPS), optimizing probabilistic calibration across rainfall bins.
Baselines: Competing fully-trainable 3D-UNET baselines with RPS and Gamma-Hurdle probabilistic heads.
Outcomes: The frozen-transformer approach yields improved CRPS (3.5102) and better calibration than convolutional UNETs for global rainfall prediction at high resolution (Filho et al., 14 Nov 2025).

Geometric, Spherical Neural Operators (FourCastNet, FourCastNet 3)

Design: Purely convolutional operators defined on the sphere, using channelwise local and spectral convolutions to respect multi-variate meteorological structure and large-scale geodesics.
Ensemble Forecasting: Probabilistic skill is delivered via end-to-end, CRPS-optimized training; ensemble generation is achieved through sampling a multi-level spherical diffusion process for initial conditions.
Scalability: Training across thousands of GPUs (>40 TB ERA5 data) with distributed domain decomposition, allowing rapid inference (90-day global forecast at 0.25° and 6-hourly time steps in under 20 seconds on a single GPU) (Bonev et al., 16 Jul 2025).
Baseline Status: FourCastNet and its variants are recommended high-resolution, open-source baselines for Weather4Cast due to spectral realism, calibration, and throughput (Pathak et al., 2022, Bonev et al., 16 Jul 2025).

Comparative Table of Architectures

Model	Core Approach	Probabilistic Output	Notable Metrics (Cumulative Rainfall)
ConvGRU (Cascade) (Bhuskute et al., 14 Nov 2025)	RNN (ConvGRU) + temp. cascade	Empirical CDF	CRPS 3.37, RMSE 2.48 mm, SSIM 0.747
DINOv3-ViT+Head (Filho et al., 14 Nov 2025)	Frozen ViT + CRPS head	Discrete eCDF	CRPS 3.5102, 26% improvement over 3D-UNET
FourCastNet (FCN3) (Bonev et al., 16 Jul 2025)	Spherical CNN/operator	Ensemble, CRPS	Z500 CRPS 0.53, t2m CRPS 0.50

3. Probabilistic and Event-based Evaluation

The benchmark prioritizes probabilistic calibration and precise event-detection skill, addressing recognized limitations in pixelwise deterministic evaluation for severe/extreme weather.

CRPS and RPS: Both continuous and discrete analogs are employed, directly optimizing score-aligned model probabilities or eCDFs for rainfall accumulations.
Additional Scores: F1, CSI, Probability of Detection (POD), False Alarm Ratio (FAR), ensemble skill–spread diagnostics, rank histograms, and angular/zonal spectral power diagnostics are used for higher fidelity model intercomparison.
Event Scoring: Matching of predicted and ground-truth precipitation events is conducted via spatiotemporal overlap metrics, reporting both categorical (F1, CSI) and continuous attributes (centroid, duration, intensity) (Bhuskute et al., 14 Nov 2025).

4. Operational and Computational Considerations

Preprocessing Efficiency: Zero-padding and normalization of inputs (SEVIRI $T' = T / 300\,\mathrm{K}$ for ConvGRU) and frame-wise cropping for vision backbones facilitate GPU-based batch processing.
Forecast Parallelism: The staggered, non-autoregressive ConvGRU cascade and batch-projected vision transformer architectures support parallel inference at multiple lead times without recursive sampling.
Large-scale Deployment: FourCastNet 3 demonstrates training and inference on 1,024 GPUs for global domains, and sub-minute generation of large ensemble forecasts (e.g., 90 days in 20 seconds), critical for operational deployment and reanalysis scenarios (Bonev et al., 16 Jul 2025).
Foundation Model Reuse: Leveraging frozen satellite foundation models (ViTs) as shared world-model priors enables real-time inference, reduced domain shift, and ease of integration with multiple forecasting heads (Filho et al., 14 Nov 2025).

5. Results and Baseline Recommendations

Cumulative Rainfall Task:
- The ConvGRU-based method achieved CRPS = 3.37 (2nd place), with notable secondary scores (e.g., [email protected] mm = 0.682, SSIM = 0.747).
- The DINOv3-projected eCDF method achieved CRPS = 3.5102, outperforming the best convolutional UNET by ~26% in effectiveness (Filho et al., 14 Nov 2025).
- FourCastNet 3 demonstrated state-of-the-art probabilistic calibration and sub-seasonal predictive skill, with pointwise CRPS for z500 of 0.53 and temperature at 2 m of 0.50—exceeding IFS-ENS by ~5–7% (Bonev et al., 16 Jul 2025).
Event Prediction Task:
- ConvGRU-based sequence with 3D 18-connectivity labeling and event extraction achieved F1 scores close to or matching the baseline, ranking tied for 2nd. Precision in spatiotemporal aggregation and severity assignment was key.
- Strengths across all baselines include rapid inference and calibration; limitations persist in regional error structure and uncertainty quantification for deterministic models.
Reference Baseline Adoption: FourCastNet and spherical-CNN derivatives are cited as the principal open, scalable baseline for global and regional precipitation nowcasting, attributed to modular extensibility, rapid ensemble generation, and high spectral fidelity (Pathak et al., 2022, Bonev et al., 16 Jul 2025).

6. Methodological Implications and Forward Directions

Weather4Cast 2025 crystallizes a methodological shift towards transfer learning from large foundation models (satellite ViTs and spherical-CNNs), the centrality of explicit probabilistic calibration (CRPS-aligned losses), and rigorous event-based spatial verification. A plausible implication is the acceleration of research on:

Parameter-efficient adaption of frozen vision and geometric foundation models to new local domains.
Direct integration of radar and satellite radiance in multi-task diagnostic heads.
Physics-informed constraints (mass/energy conservation) implemented via differentiable operator penalties.
Fine-grained, region-fused models and ensemble assimilation for uncertainty quantification (Pathak et al., 2022, Filho et al., 14 Nov 2025, Bonev et al., 16 Jul 2025).

Ongoing work is expected to focus on robust real-time deployment, expansion to multi-sensor data fusion, and generalization of probabilistic event detection across diverse meteorological phenomena.