Mixture of Weather Experts (MoWE) Approach

Updated 13 September 2025

MoWE is an ensemble approach that dynamically combines expert weather forecast models using adaptive, spatial and temporal weighting.
It leverages a Vision Transformer to generate softmax-normalized weight maps, achieving up to 10% lower RMSE at longer lead times.
Its lightweight design enables efficient integration of precomputed forecasts while correcting systematic biases and scaling with new experts.

A Mixture of Weather Experts (MoWE) is a data-driven ensemble methodology that synthesizes the predictions of multiple state-of-the-art weather forecast models using a learned, dynamic weighting mechanism. Rather than developing a new forecasting model ab initio, MoWE is designed to combine the outputs of pre-existing expert models in a spatiotemporally adaptive manner, with the aim of surpassing the predictive accuracy of any individual expert. The approach leverages a transformer-based gating network to learn which models are most reliable for a given forecast grid point and lead time, directly optimizing for reduced forecast Root Mean Squared Error (RMSE). MoWE demonstrates substantial improvements—up to 10% lower RMSE than the strongest standalone expert at a 2-day horizon—while remaining computationally lightweight relative to training or operating new monolithic models (Chakraborty et al., 10 Sep 2025).

1. MoWE: Paradigm and Differentiation from Traditional Forecasting

MoWE does not introduce a new numerical weather prediction or end-to-end deep learning model; instead, it forms a meta-model operating exclusively at the output level of several high-performing expert models. Each expert is itself a sophisticated data-driven forecaster (such as Pangu, Aurora, FCN3, etc.), and MoWE's function is to optimally combine their predictions. Key differentiators from traditional forecasting approaches include:

Ensemble via Conditional Weighting: The method learns a spatially and temporally varying set of weights over the experts, unlike fixed-weight averaging or classic machine learning ensembling that does not adapt at inference time.
Minimal Retraining Cost: Since MoWE reuses existing expert outputs and only requires training a relatively compact gating network, it is dramatically more efficient in terms of resource requirements than retraining or fine-tuning full weather models.
Deterministic Synthesis: The output at each forecast location is not simply a mean or median but is constructed as a weighted sum of the expert predictions, with the weights inferred by the gating mechanism conditioned on model inputs such as grid location and forecast lead time.

2. Model Architecture and Vision Transformer-based Gating

The computational core of MoWE is a Vision Transformer (ViT)-based gating network. The architectural workflow is as follows:

Input Formation: Expert model forecasts are aggregated into a multi-channel image where each channel (per grid point) corresponds to the output from one expert.
ViT Processing: This image tensor is split into spatial patches, linearly projected, and augmented with information such as lead time and (optionally) a noise vector for probabilistic scenarios. These patches are then passed through several transformer layers to capture local spatial correlations as well as long-range dependencies.
Dynamic Weight Maps: The ViT outputs a set of “weight maps” (one map per expert, shape matching the forecast grid) and an additional bias map b. The weights are normalized via a softmax operation along the expert dimension to enforce convexity at each grid point (i.e., the sum of weights is unity for each spatial or spatiotemporal location).
Final Prediction: For each variable and grid point, the final forecast is computed as

$\hat{Y} = \sum_{i=1}^N \left( W_i \odot E_i \right) + b$

where $E_i$ is the expert i's forecast field, $W_i$ is its weight map, and $\odot$ denotes elementwise multiplication. The bias b adds a global adjustment.

This vision-attentive gating mechanism imbues MoWE with the ability to discern complex spatial and temporal regimes in which specific experts excel or underperform, and to synthesize this information dynamically at inference time.

3. Empirical Performance: Evaluation and RMSE Metrics

MoWE is designed to minimize the mean squared error (MSE), and RMSE is used as the central evaluation measure:

$\mathrm{RMSE} = \sqrt{ \frac{1}{n} \sum_j (\hat{Y}_j - Y_j)^2 }$

where $\hat{Y}_j$ is the MoWE prediction and $Y_j$ is ground truth at the j-th point. In comparative studies across multiple lead times and variables (see the Results section of (Chakraborty et al., 10 Sep 2025)):

At short lead times: Some individual experts (e.g., Pangu) may outperform the simple mean ensemble.
At longer lead times: Simple model averaging closes the gap with the best expert, but MoWE still maintains a consistent improvement in RMSE, often by a margin of $7\% - 10\%$ at a 2-day horizon.
Spatial Adaptivity: The dynamic weights assigned by the gating network shift across geographic regions—demonstrating that MoWE learns, for instance, to rely more heavily on experts whose performance is superior in particular climatic regimes or at certain times.

These improvements are realized for deterministic predictions, with all calculations conducted grid pointwise for each forecast variable and lead time.

4. Computational Efficiency and Scaling

The MoWE model presents a significant reduction in resource consumption compared to conventional re-training of full expert models:

Low Additional Overhead: Training and inference costs of the MoWE gating network are orders of magnitude smaller than those associated with each individual expert model.
Parallelism: MoWE can operate on precomputed expert forecasts, enabling "post-processing" ensemble correction atop existing supercomputer outputs or deep-learning models—facilitating fast deployment and rapid iteration.
Scalability: The ViT-based gating approach is inherently parallelizable and can accommodate an arbitrary (fixed) number of experts, making the method amenable to future expansion as new high-quality expert models become available.

5. Methodological Implications and Impact on Weather Forecasting

The MoWE strategy represents a departure from traditional competitive benchmarking of weather models toward a collaborative, “multi-headed” paradigm. Implications include:

Expert Exploitation: Heterogeneity among expert model skill across variables, regions, and scenarios is systematically leveraged, rather than suppressed by model averaging.
Bias Correction: The adaptive weighting provides a mechanism for robustly correcting for recurring systematic biases in component models, without introducing additional tuning parameters or complex post-processing pipelines.
Operational Attractiveness: Because training and deployment are lightweight, MoWE can feasibly be integrated into forecasting operations without prohibitive computational expense and without disrupting the production of existing models.

A plausible implication is that as further data-driven models emerge—each excelling under different conditions—MoWE or similar ensemble strategies could become the de facto standard for combined deterministic and probabilistic weather forecasting.

6. Limitations and Future Research Directions

While the MoWE approach as defined in (Chakraborty et al., 10 Sep 2025) demonstrates substantial empirical gains and computational benefits, several future directions are evident:

Probabilistic Outputs: There is potential to extend the gating network and synthesis function to the combination of full probabilistic forecast distributions, rather than pointwise deterministic fields.
Integration with Physics-based Experts: As more “hybrid” or physically informed data-driven models emerge, future MoWE frameworks may treat not only deep learning models but also high-resolution numerical simulations as experts—expanding the diversity and reliability of the expert pool.
Robustness to Expert Failure: Handling cases where certain experts become unavailable or suffer from catastrophic prediction failures remains an area for further methodological development.
Adaptive Expert Pool Expansion: Dynamic addition (or retirement) of expert models and temporal adaptation of the gating function, potentially with meta-learning approaches, could provide further accuracy and resilience improvements.

7. Summary Table: Core Elements of MoWE for Weather Forecasting

Element	Description	Mathematical Formulation
Synthesized Prediction	Weighted sum of expert forecasts using softmax-normalized weights	$\hat{Y} = \sum_{i=1}^N (W_i \odot E_i) + b$
Gating Network	Vision Transformer applied to stacked forecast fields (plus metadata)	N/A (ViT-based, patched, multi-layer)
Training Objective	Minimize MSE between synthesized and ground truth fields	$\mathrm{MSE} = \frac{1}{n} \sum_j (\hat{Y}_j - Y_j)^2$
Weight Normalization	Spatial softmax constraint along expert dimension	$\sum_{i=1}^N W_{i, j} = 1$ (for each $j$ )
Efficiency	Lightweight: no backprop to expert models; gating model is compact	N/A

This organizational approach enables a modular, adaptive, and computationally attractive methodology for pushing the state of the art in operational weather forecasting by maximally exploiting the strengths and diverse skill sets of contemporary data-driven models (Chakraborty et al., 10 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

MoWE : A Mixture of Weather Experts (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Mixture of Experts (MoWE) Approach.