Papers
Topics
Authors
Recent
2000 character limit reached

Aurora Foundation Model Overview

Updated 21 November 2025
  • Aurora Foundation Model is a family of large-scale transformer architectures for Earth system modeling, weather forecasting, and multimodal time series analysis.
  • It employs hierarchical, patch-based transformer backbones with modality-specific encoders and innovative adaptation strategies to ensure cross-domain generalization.
  • It achieves superior computational efficiency and predictive accuracy through lightweight decoder training and full fine-tuning, validated against operational forecasting benchmarks.

The Aurora Foundation Model encompasses a family of large-scale transformer architectures designed as general-purpose foundation models for applications in Earth system modeling, weather and hydrological forecasting, multimodal time series analysis, and parameter-efficient multimodal learning. Distinct variants have been developed by independent research groups, unified by a focus on scalable pretraining, strong cross-domain generalization, extensibility to new targets, and computational efficiency through architecture and adaptation design.

1. Model Architectures and Core Design Principles

The Aurora family is grounded in hierarchical, patch-based transformer backbones employing modality-specific encoders and sophisticated adaptation strategies.

1.1. Atmospheric and Earth System Foundation Model

Aurora for weather and Earth system prediction uses an encoder–processor–decoder pipeline (Bodnar et al., 2024, Lehmann et al., 23 Jun 2025):

  • Inputs: Two consecutive global "images" Xt1,XtX^{t-1}, X^t of size T×H×WT \times H \times W (typically T=2T=2, H=720H=720, W=1440W=1440).
  • Patch Embedding: Images are tiled into patches (T×P×PT \times P \times P, P=4P=4), then linearly projected into E=512E=512-dimensional embeddings.
  • Encoder: A 3D Perceiver condenses C=13C=13 atmospheric pressure levels to CL=3C_L=3 "latent levels." Surface and static variables, such as land–sea mask and soil type, are concatenated as additional channels.
  • Processor: 3D Swin-Transformer U-Net backbone for spatiotemporal encoding. Outputs a latent tensor of shape (HW/P2)×4×2E(HW/P^2) \times 4 \times 2E (surface + latent levels).
  • Decoder: Variable-specific linear layers project $2E$-dim embeddings to patch reconstructions.

1.2. Multimodal and Parameter-Efficient Aurora Variants

Aurora has also been instantiated as:

  • A multimodal time series forecasting backbone with separate encoders for text (BERT), images (ViT), and timeseries, using modality-guided multi-head self-attention and prototype-guided flow-matching generative decoders (Wu et al., 26 Sep 2025).
  • A parameter-efficient cross-modal prompt-tuning variant leveraging mode-approximation with only ~0.1M tunable parameters for efficient transfer (Wang et al., 2023).

2. Pretraining Datasets, Objectives, and Representation Learning

2.1. Pretraining Corpus and Coverage

Weather and Earth System Aurora:

Aurora is pretrained on >1 million hours of heterogeneous weather and climate datasets (\sim1.2 PB), incorporating:

  • Reanalyses (ERA5, MERRA-2, CAMSRA)
  • Operational forecasts (GFS, IFS-HRES), ensemble means, climate simulations (CMCC-CM2-VHR4)
  • Spatiotemporal resolutions: 0.250.25^\circ0.750.75^\circ in space, Δt=6\Delta t=6 h steps
  • Variables: 4 surface (2 m temperature, 10 m winds, MSL pressure), 5 atmospheric variables on up to 13 pressure levels, plus static maps (Bodnar et al., 2024)

Multimodal Time Series Aurora:

>1 billion labeled samples drawn from 30+ open datasets (ERA5, Monash, UEA/UCR, IoT), with textual descriptions generated at sample level (Wu et al., 26 Sep 2025).

2.2. Learning Objectives

Latitude-Weighted Mean Absolute Error (MAE):

Lpretrain=levelscchannels1HWi,jwiX^i,j,c,Xi,j,c,,wi=cos(latitudei)\mathcal{L}_{\rm pretrain} = \sum_{\ell \in \text{levels}}\sum_{c \in \text{channels}} \frac{1}{HW} \sum_{i,j} w_i \left| \hat X_{i,j,c,\ell} - X_{i,j,c,\ell} \right| , \quad w_i = \cos(\text{latitude}_i)

to account for grid cell area heterogeneity (Lehmann et al., 23 Jun 2025).

Multimodal and Domain Guidance:

For time series variants, objective functions include flow-matching for generative probabilistic forecasting and cross-modal attention terms that explicitly inject distilled knowledge from text/image modalities (Wu et al., 26 Sep 2025).

3. Adaptation Strategies: Lightweight Decoders and Full Model Tuning

3.1. Lightweight Decoder Approach

Instead of updating all model weights for new variables (e.g., hydrology), a compact three-layer MLP (EE/2E/21E \to E/2 \to E/2 \to 1; ReLU activations, \approx300k parameters per head) is trained atop the frozen Aurora latent tensor (Lehmann et al., 23 Jun 2025). Training minimizes a latitude-weighted MAE on relevant land pixels, with no weight decay or dropout. This approach:

  • Reduces wall-clock time by ~50%
  • Decreases memory usage by ~35%
  • Inherits autoregressive stability from the backbone

3.2. Full Fine-Tuning Baseline (Aurora⁺)

Aurora⁺ unfreezes all 1.3B parameters, adds new variables as input/output channels, and optimizes the original loss jointly over prior and new targets. This method achieves lower error on new variables but with substantially higher resource requirements (peak GPU memory ~99GB, \sim20× greater FLOPS) (Lehmann et al., 23 Jun 2025).

3.3. Relation to Parameter-Efficient Multimodal Tuning

Aurora mode-approximation (for vision–language transfer) learns a compact difference tensor over pretrained attention weights, with key innovation in CP decomposition: ΔWr=1Rλr(urvrpr)\Delta\mathcal{W} \approx \sum_{r=1}^R \lambda_r (u_r \circ v_r \circ p_r) yielding tunable updates that are just 0.04% of the base model size (Wang et al., 2023).

4. Empirical Performance and Evaluation

4.1. Atmosphere, Hydrology, and Cross-Variable Generalization

Aurora’s latent representation enables accurate prediction of hydrological and radiative variables never seen during pretraining via decoder heads. Performance metrics (6h lead, 2020) (Lehmann et al., 23 Jun 2025):

Variable Metric Decoder Aurora⁺
Potential evaporation PCC 0.958 0.992
Runoff PCC 0.420 0.559
Soil moisture PCC 0.969 0.999
Precipitation MAE (mm) 0.32 0.22
Precipitation PCC 0.71 0.86

High Pearson correlations for evaporation and soil moisture confirm that Aurora’s latent space encodes multivariate physical dependencies, while moderate performance for runoff suggests limits governed by pretraining variable correlations.

Aurora also demonstrates state-of-the-art results against operational and neural weather prediction systems on global high-resolution weather, air quality, and extreme event forecasts (Bodnar et al., 2024).

4.2. Computational Efficiency

Lightweight decoder training: 4×1011\sim4 \times 10^{11} FLOPS, 0.34 samples/s, 65 GB peak GPU Full fine-tuning: 3.1×1013\sim3.1 \times 10^{13} FLOPS, 0.16 samples/s, 99 GB GPU

Inference on a single GPU is orders of magnitude faster (up to 5000×5000 \times) than classical NWP (Bodnar et al., 2024).

4.3. Stability and Rollout

Decoder heads maintain stable multi-step (up to 384h lead) autoregressive rollouts without error escalation, indicating that freezing the backbone preserves the temporal consistency learned during pretraining (Lehmann et al., 23 Jun 2025).

4.4. Multimodal Time Series Zero-Shot and Probabilistic Forecasting

Aurora’s multimodal time series variant outperforms prior approaches (e.g., Sundial, VisionTS) across TimeMMD, TSFM-Bench, and ProbTS, with up to 31% MSE and 38% CRPS reduction in zero-shot cross-domain settings (Wu et al., 26 Sep 2025). Performance spans deterministic, probabilistic, unimodal, and multimodal scenarios.

5. Interpretability, Latent Space, and Extensibility

Aurora’s latent representations, though trained only on a finite set of atmospheric variables, are empirically shown to encode physical relationships with unobserved targets. Decoder prediction skill for new variables correlates strongly with their known physical coupling to pretraining variables (e.g., rainfall with moisture-flux convergence). This suggests that an important metric for foundation models in the Earth sciences is extensibility: the capacity to generalize via probing or light adaptation to variables and processes outside the pretraining set (Lehmann et al., 23 Jun 2025).

6. Implications, Best Practices, and Limitations

6.1. Best Practices for Resource-Constrained Use

  • Freeze the backbone to minimize recomputation
  • Attach compact task-specific MLP heads per variable
  • Apply domain-informed loss masking (latitude, land/sea)
  • Employ warmup+cosine decay learning rate schedules
  • Prefer partial adaptation to full fine-tuning for compute-constrained clusters

6.2. Model Limitations and Future Directions

  • Deterministic forecasts only—probabilistic ensembles and improved uncertainty quantification are open research problems (Bodnar et al., 2024, Wu et al., 26 Sep 2025)
  • Global, not regional, optimization—enhancement with regional high-resolution data remains unexplored
  • Input modalities—In multimodal Aurora, current textual descriptions are LLM-generated; performance on real user-provided exogenous metadata has not been evaluated
  • High pretraining cost—parameter-efficient distillation and cross-modal adaptation for practical deployment
  • Model extensions—Earth system coupling (land, ocean, ice, air), additional modalities (e.g., audio, radar), and continuous-time decoders are potential future avenues

Aurora offers a rigorous blueprint for scalable, extensible, and computationally tractable foundation modeling in the Earth sciences. Its development marks a convergence of multi-scale transformer architectures, robust cross-domain adaptation, and practical strategies for enabling widespread, resource-aware application across atmospheric and hydrological forecasting domains (Bodnar et al., 2024, Lehmann et al., 23 Jun 2025, Wu et al., 26 Sep 2025, Wang et al., 2023).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Aurora Foundation Model.