Data-Driven Weather Models

Updated 7 February 2026

Data-driven weather models are forecasting systems that leverage machine learning to learn mappings from past to future atmospheric states for high-accuracy predictions.
They integrate advanced architectures like 3D transformers, graph neural networks, and autoencoders to efficiently capture spatial-temporal dynamics.
These models reduce computational expenses and support modular extensions for diagnostic variable prediction and hybrid integration with traditional NWP systems.

Data-driven weather models are a class of forecasting systems that use large-scale machine learning, most commonly deep learning, to predict atmospheric evolution by learning mappings from past atmospheric states to future states. Unlike traditional numerical weather prediction (NWP), which relies on explicit discretization and integration of the governing partial differential equations of fluid dynamics and thermodynamics, data-driven approaches seek to implicitly learn the predictive structure directly from historical reanalyses or observational records. Current data-driven models have demonstrated competitive, frequently state-of-the-art, accuracy in global and regional medium-range forecasting, drastically reducing computational costs and enabling new operational and scientific paradigms.

1. Foundations of Data-Driven Weather Prediction

Data-driven weather prediction models (DDWPs) learn parametric mappings $f_\phi$ from a high-dimensional representation of the prognostic state, typically including gridded geopotential height, temperature, wind components, and humidity, to future values of the same or related fields. Most workflows train $f_\phi$ to minimize a mean squared error or related loss between predicted and observed or reanalyzed fields, using large archives such as ECMWF’s ERA5. Models are distinguished by their treatment of spatial structure, temporal evolution, and the diversity of output variables.

Traditional DDWPs focus on a restricted set of core prognostic variables, limiting their direct utility for sectors requiring diagnostic (derived) variables, such as cloud cover or solar irradiance. Two strategies have historically addressed this gap: (a) training bespoke models for each diagnostic field, or (b) augmenting the backbone model by retraining with additional output heads. Both scale poorly, prompting the need for modular extension schemes (Mitra et al., 2023).

Methodologically, DDWPs leverage neural architectures suitable for global or regional prediction domains:

Transformers (with or without hierarchical structure): e.g., Pangu-Weather’s (3D Earth-Specific Transformer) and GraphCast’s multi-mesh GNN design (Bi et al., 2022).
Graph Neural Networks: encoding multi-resolution spherical and stretched-grid geometry (Nordhagen et al., 28 Nov 2025, Nipen et al., 2024).
Autoencoders: learning task-agnostic latent embeddings for efficient variable-specific downstream regression (Mitra et al., 2023).
Hybrid Statistical/ML Models: residual-learning strategies combining SARIMA models for climate-scale structure and LSTMs for nonlinear weather anomalies (Rajeev et al., 12 Jan 2026).
Direct Observation-Space Models: transformers ingesting unordered token sequences of raw observations, bypassing data assimilation and gridding (McNally et al., 2024).

2. Core Architectures and Training Schemes

State-of-the-art DDWPs encode both the vertical and horizontal structure of the atmosphere using a variety of architectural elements:

3D Transformer Architectures: Pangu-Weather organizes data as cubic tensors (height × lat × lon) and applies windowed self-attention with Earth-specific positional biases, supporting autoregressive and hierarchical temporal aggregation to trade off accuracy and efficiency across lead times (Bi et al., 2022).
GraphCast and Multi-Mesh GNNs: Encode atmospheric fields on hierarchical icosahedral meshes, connecting multiple spatial scales and enabling fast information propagation and regional refinement. Stochastic latent injection is adopted for ensemble probabilistic prediction (Nordhagen et al., 28 Nov 2025, Nipen et al., 2024).
Autoencoding for Latent State Extraction: In the two-stage framework, an AFNO-based autoencoder is trained on the full space of prognostic fields, with the encoder $f_\phi^*$ frozen and downstream regressors $h_\psi$ trained per-diagnostic. This modularity allows new diagnostics to be added at minimal additional computational cost and with accuracy nearly matching bespoke models (Mitra et al., 2023).
Hybrid Time-Series Predictors: A residual decomposition assigns SARIMA to seasonal components and LSTM to residual, short-term fluctuations. Recursive multi-step prediction employs decay to stabilize forecasts beyond the short range (Rajeev et al., 12 Jan 2026).
Pure Observation-Space Transformers: Raw multi-modal instrument tokens are embedded, attended, and decoded directly to arbitrary query points, providing forecasts without any dependence on reanalysis or gridded data assimilation (McNally et al., 2024).

Loss functions and optimization strategies are closely tied to forecast fidelity:

Standard MSE Loss: Widely used for gridded field regression, but suffers from over-smoothing due to the “double penalty” (see Section 4).
Spectral and Probabilistic Losses: Spectral CRPS and modified spherical harmonic losses correct the bias inherent in pointwise MSE by restoring variance and phase coherence at high wavenumbers, critical for extremes (Subich et al., 31 Jan 2025, Nordhagen et al., 28 Nov 2025).

3. Modularity for Diagnostic Variable Prediction

A central challenge is extending a DDWP trained on a fixed set of prognostic fields to predict new diagnostic quantities without wholesale retraining. The two-stage embedding approach achieves this by:

Stage One (Embedding): Training a deep autoencoder that learns a dense, task-agnostic latent representation of the prognostic weather state. No regularization beyond architecture is required, and choices such as patch size, AFNO layers, and channel width are tuned for expressiveness and parameter efficiency. The encoder, once trained ( $\phi^*$ ), is frozen.
Stage Two (Downstream Regression): For any diagnostic field $y$ , a compact network $h_\psi$ is trained to regress $y$ from latent codes $z = f_{\phi^*}(\chi_t)$ . This downstream head is much smaller than a full DDWP, and multiple diagnostic regressors can be trained in parallel or added indefinitely without interacting.
Resource Efficiency: Stage-2 models require $\sim28$ M parameters each, versus $\sim75$ M for bespoke end-to-end retraining, and reduce backwards-pass cost by nearly half (Mitra et al., 2023).

Quantitative evaluation for total cloud cover (tcc) and soil temperature (stl1) shows differences in RMSE and SSIM between bespoke and downstream approaches are typically <1%, with sometimes superior structural similarity for the downstream tcc regressor.

4. Addressing Forecast Smoothness, Resolution, and Probabilistic Skill

A prevailing drawback of data-driven forecasting is the systematic under-representation of fine-scale structure, particularly as forecast lead time increases. This "double penalty" arises because minimizing MSE dampens spectral energy at scales with low predictability:

Mathematical Origin: Spectral expansion of the MSE shows that at each wavenumber $k$ , the loss rewards variance reduction (amplitude shrinking) and penalizes misplaced phase (decorrelation). For scales where phase is unpredictable, the solution is to simply damp the forecast (Subich et al., 31 Jan 2025).
Modified Loss Functions: Adjusted MSE (AMSE), operating in spherical harmonic space, decouples amplitude and decorrelation penalties, ensuring that forecast energy at each scale matches observations regardless of phase coherence, thus preserving physically realistic sharpness and effective resolution down to $\sim160$ km (Subich et al., 31 Jan 2025).
Probabilistic Forecasting: Stochastic architectures (e.g., diffusion models, latent variable injection) and specific probabilistic losses (CRPS in both real and spectral space) are required to calibrate ensemble spread and avoid under-dispersive, over-confident forecasts (Nordhagen et al., 28 Nov 2025, Brenowitz et al., 2024).
Benchmarking and Post-processing: Lagged ensembles constructed from deterministic hindcasts provide a parameter-free baseline for CRPS skill evaluation. Post-processing methods such as Bernstein Quantile Networks are effective for bias correction and uncertainty quantification, with up to 50% CRPS improvements for global models (Bremnes et al., 2023).

5. Integration with Operational Systems and Hybrid Approaches

While standalone DDWPs deliver global and regional forecasts with competitive deterministic skill and massive computational gains, several approaches enhance or operationalize these models:

4D-Var Data Assimilation with DDWPs: Data-driven models can be natively incorporated into 4D-Var schemes, using autodiff in deep learning frameworks (e.g., PyTorch) to compute all variational gradients. The computational graph treats the deep model as the forecast operator, enabling assimilation of sparse, noisy, or irregular observations without requirement for adjoint models (Xiao et al., 2023).
Cyclic and Hybrid Data Assimilation: Real-data assimilation loops, e.g., with ensemble Kalman filters or variational methods, initialized from and coupled to DDWPs, enable operational forecasts to leverage both rapid AI-based forecasting and observation-driven corrections (Wang et al., 2024).
Hybrid NWP–AI Fusion: Spectral nudging—blending large-scale AI predictions into high-resolution NWP simulations—retains the full suite of physical and diagnostic variables, leveraging ML skill where it is strongest (synoptic scales, longer leads), while entrusting small-scale, extreme, and physical consistency to traditional models. Empirically, this strategy cuts RMSE by up to 8% and increases ACC by 0.04, with particular gains in winter and in cyclone-track prediction (Husain et al., 2024).
Direct Observation Space Forecasting: Eliminating reanalysis data dependency entirely, some recent models are trained and initialized solely from sequences of raw instrument readings, using transformers to predict future point observations or interpolate to arbitrary geolocations, sidestepping the complexities of classical data assimilation entirely (McNally et al., 2024).

6. Verification, Performance, and Benchmarking

The performance of data-driven weather models is rigorously assessed through established synoptic metrics and dedicated benchmarks:

Headline Deterministic Scores: RMSE and ACC for upper- and surface-level fields up to 10-day leads, routinely matching or exceeding IFS HRES for large-scale variables, but displaying trailing skill for intense precipitation and local extremes (Bi et al., 2022, Rasp et al., 2023).
Extreme Events: Tracks and intensities of tropical cyclones are predicted with mean 3–5 day errors below 200km in Pangu-Weather, and intensity bias is minimized with spectral-loss fine-tuning (Bi et al., 2022, Subich et al., 31 Jan 2025).
Probabilistic Metrics: CRPS, ensemble spread-skill ratios, and Brier scores are emphasized. Proper spread–skill calibration (ESSR $\simeq$ 1) is obtained by combining deep learning ensembles and bias-correction; under-dispersion remains a challenge in raw DDWP output (Rasp et al., 2023, Brenowitz et al., 2024).
Spectral Energy Metrics: Verification of energy spectra demonstrates that MSE-trained models show artificially steep spectral decay and under-resolved small-scale energy; AMSE and similar loss functions restore a more physical power-law spectrum (Subich et al., 31 Jan 2025).
Benchmark Datasets: WeatherBench 2 provides open data, evaluation code, and continuous leaderboards for standardized multi-model and multi-metric comparison, sustaining rapid progress and community transparency (Rasp et al., 2023).

7. Challenges, Scalability, and Future Directions

Despite their rapid advance and operational penetration, data-driven weather models confront key open problems:

Representation of Extremes and Non-Gaussian Statistics: Approaches such as diffusion models and generative adversarial learning are being researched to recover realistic precipitation tails and intensity distributions, especially at mesoscale (Hirabayashi et al., 25 Mar 2025).
Upscaling and Regionalization: Stretched-grid architectures and hierarchical regional models bring km-scale resolution only to selected areas, maintaining computational efficiency while enabling local forecasting skill (Nordhagen et al., 28 Nov 2025, Nipen et al., 2024).
Physical Consistency and Chemical/Earth-System Coupling: Models are being extended to forecast coupled Earth-system components (ocean, land, composition) and to enforce explicit conservation laws via physics-informed losses or hybrid ML–dynamical cores (Han et al., 2024, McNally et al., 2024).
Interpretability and Scientific Discovery: Sparse regression techniques (e.g., WSINDy) remain of interest for learning interpretable governing equations and identifying explicit dynamical balances, complementing black-box neural surrogates (Minor et al., 1 Jan 2025).
Operationalization: Emphasis is shifting to fast, robust, and interpretable inference systems, enabling probabilistic and tailored downstream products (diagnostics, energy, hydrology), and integration into hybrid physics–AI workflows (Vaughan et al., 2024, Weyn et al., 2024).

Scaling these architectures further—either through increased spatial resolution, longer lead times, or full Earth system coupling—will require continued innovation in network design, loss function engineering, training data curation, and performance evaluation.

References:

(Mitra et al., 2023, Bi et al., 2022, Subich et al., 31 Jan 2025, Nordhagen et al., 28 Nov 2025, Nipen et al., 2024, Rajeev et al., 12 Jan 2026, McNally et al., 2024, Wang et al., 2024, Rasp et al., 2023, Bremnes et al., 2023, Xiao et al., 2023, Husain et al., 2024, Han et al., 2024, Hirabayashi et al., 25 Mar 2025, Vaughan et al., 2024, Brenowitz et al., 2024, Minor et al., 1 Jan 2025, Weyn et al., 2024)