3D Earth-Specific Transformer (3DEST)

Updated 19 September 2025

3DEST is a transformer architecture designed for 3D Earth system datasets, incorporating spherical geometry and Earth-specific positional encoding.
It employs localized self-attention and multimodal fusion to efficiently capture spatial and temporal dependencies in weather and geophysical modeling.
Applications include state-of-the-art weather forecasting, subsurface reconstruction, and 3D asset synthesis with measurable improvements over traditional models.

A 3D Earth-Specific Transformer (3DEST) is a transformer-based deep learning architecture designed for spatially and physically structured Earth system datasets, typically encompassing multi-modal, multi-scale, and three-dimensional properties. It advances the processing and modeling of volumetric, geospatial, and environmental data by explicitly incorporating spherical geometry, domain-specific positional encoding, and flexible input representations adapted to the complexity of Earth science applications.

1. Architectural Principles of 3DEST

The 3DEST design is exemplified in models such as Pangu-Weather (Bi et al., 2022) and Transparent Earth (Mazumder et al., 2 Sep 2025). In these systems, input data are not restricted to planar 2D grids but are volumetrically structured with additional axes capturing atmospheric height/pressure layers or subsurface properties. The canonical input structure for atmospheric modeling, for example, consists respectively of upper-air variables over 13 pressure levels and surface variables arranged into data cubes (e.g., [13 × 1440 × 721 × 5] for upper air). Patch embedding, dimensionality reduction, and latent cube formation convert these cubes to a representation suitable for attention mechanisms.

Crucially, self-attention is performed within local windows or patches (shifted-window in 3DEST, sometimes cuboid attention in related architectures such as Earthformer (Gao et al., 2022)). This localization is designed to efficiently capture both inter-level and spatial dependencies, and is enhanced by a learnable Earth-specific positional bias (B_ESP) that reflects the nonuniform grid and spherical structure of Earth coordinates.

For multimodal and subsurface modeling, tokens are augmented with both modality and positional encodings. In the Transparent Earth approach, each input token $x_i$ comprises the raw feature vector ( $f_i$ ), a positional encoding ( $p_i$ ), and a modality embedding ( $m_i$ ), before being projected into the shared latent space.

2. Geospatial and Physical Encoding Strategies

A core technical advance is the use of Earth-specific positional encoding, moving beyond conventional flat 2D or sequential 1D scheme. For atmospheric data, a positional bias is a learnable function of latitude, vertical (pressure level), and periodic longitude (cyclic handling of Earth's meridians).

For geospatial tokens, “geotokens” (Unlu, 2024) extend the RoPE (Rotary Position Embedding) method to spherical coordinates, applying an Euler-inspired rotation matrix parametrized by latitude ( $\phi$ ) and longitude ( $\theta$ ):

$R'_{e} = \begin{bmatrix} \cos\theta & -\cos\phi\sin\theta & \sin\phi\sin\theta \ \sin\theta & \cos\phi\cos\theta & -\sin\phi\cos\theta \ 0 & \sin\phi & \cos\phi \end{bmatrix}$

This encoding ensures that relative spatial relationships are properly modeled in the embedding space, enabling accurate representation of proximity and spatial orientation.

For multimodal fusion (as in Transparent Earth), positional encodings are computed using sinusoidal transforms augmented with depth for subsurface modalities, e.g.:

$e = [\sin(\pi \phi \times f_\phi), \cos(\pi \phi \times f_\phi), \sin(\pi \lambda \times f_\lambda), \cos(\pi \lambda \times f_\lambda), z]$

where $\phi$ is latitude, $\lambda$ is longitude, $f_\phi, f_\lambda$ are fixed-frequency bands, and $z$ is the depth coordinate.

3. Temporal and Multi-View Aggregation Algorithms

Medium-range Earth system forecasting is challenged by cumulative error in iterative prediction. Pangu-Weather implements a hierarchical temporal aggregation strategy: separate models are trained for differing lead times (1h, 3h, 6h, 24h), and a greedy scheduling algorithm chooses the optimal combination to minimize iterative error for arbitrary forecast horizons. For a 7-day forecast, a succession of 24-hour models reduces error accumulation compared to chaining 28 separate 6-hour models (Bi et al., 2022).

In 3D representation learning and diffusion-based generative models (see DiffTF++ (Cao et al., 2024)), multi-view constraints are imposed. The training loss combines DDPM denoising with multi-view reconstruction (projecting generated 3D features onto orthogonal planes and calculating MSE with ground-truth colors):

$\mathcal{L}_{\text{diffimg}} = \lambda_{3} \mathcal{L}_{\text{diff}} + \lambda_{4} \mathcal{L}_{\text{loss}}$

This ensures cross-view consistency, a necessity for geospatial synthesis and rendering.

4. Multimodal Integration and Scalability

3DEST is architected for extensible fusion of heterogeneous datasets using attention and flexible encoding. The Transparent Earth model integrates eight modalities ranging from stress and strain angles to sediment thickness, mantle temperature, and categorical tectonic, fault, or basin types (Mazumder et al., 2 Sep 2025). Each modality is embedded via pretrained text models projecting descriptive strings to compact vectors; this allows dynamic expansion of the model's vocabulary.

The decoder attends to input representations using queries constructed by concatenating positional encoding and the text-derived task embedding:

$Q_i = [p(q_i) \| e(t_i)]$

Extensive scalability tests show monotonic improvement in regression and classification metrics as the model is expanded from 3M to 243M parameters.

5. Applications Across Earth Systems

Deterministic weather forecasting is a central application. With the 3DEST architecture, Pangu-Weather reports state-of-the-art RMSE reductions (e.g., RMSE for Z500 reduced from 333.7 [IFS] to 296.7 [Pangu-Weather] for 5-day forecasts), and achieves inference times over 10,000× faster than conventional numerical weather prediction (Bi et al., 2022). It supports high-resolution deterministic forecasts, ensemble modeling, and extreme weather event tracking, including tropical cyclone localization.

Subsurface field reconstruction in Transparent Earth demonstrates significant MAE reductions in stress angle estimation, from ~33° with no inputs to 9°–13° with only sparse multimodal observations (Mazumder et al., 2 Sep 2025). This multimodal foundation model strategy is poised to support resource exploration, hazard assessment, and continental/regional geophysical inference.

Diffusion-based variants (DiT1D (Gabrielidis et al., 24 Apr 2025), DiffTF++ (Cao et al., 2024)) extend applicability to earthquake time history generation and 3D asset synthesis. DiT1D can “super-resolve” low-frequency physics-based simulated accelerograms into broadband history suitable for structural design and digital twins, outperforming UNet and NU-Wave2 baselines in both metrics and realism.

For spherical vision tasks (HEAL-SWIN (Carlsson et al., 2023)), the use of HEALPix-based hierarchical pixelation within transformer architectures achieves superior segmentation and depth estimation for omnidirectional sensors, with accurate spherical classification (99.20% accuracy on nonrotated MNIST).

6. Innovations and Comparative Evaluation

Key technical strengths of 3DEST include:

Explicit 3D spatial embedding and windowed attention adapted to physical geometry (spherical coordinates for weather/climate, 3D triplane for assets, cube representation for subsurface).
Learnable Earth-specific positional biases that improve attention computation in physically structured grids.
Early fusion and multimodal encoding that allows seamless expansion to new data types.
Hierarchical temporal strategies and multi-view constraints that enhance both forecasting and generative fidelity.
Efficient and scalable transformer backbone compatible with high-parameter implementations and extensive data augmentation.

Comparative evaluations indicate significant improvements over traditional NWP, convolutional baselines, and earlier transformer designs in tasks spanning Earth system forecasting, asset generation, object detection, and geospatial learning. For example, the Interactive SSM paradigm (Wang et al., 18 Mar 2025) demonstrates superior 3D object detection accuracy by dynamic update of both query and scene features, outperforming DETR-based baselines by +5.3 AP50 points on ScanNetV2.

7. Challenges and Future Directions

Operationalization of 3DEST architectures presents challenges, including:

Computational overhead in handling high-resolution, multiscale 3D data, suggesting ongoing refinement of kernel precomputation, memory reuse, and efficient inversion algorithms.
Extension to full 4D (space-time) models for comprehensive dynamic forecasting.
Improved physical grounding via incorporation of meteorological, geological, and geophysical priors, potentially via hybrid stochastic-deterministic architectures.
Zero-shot generalization and modality expansion via natural language embedding.
Integration with digital twin frameworks for real-time simulation and hazard assessment.

Advances in positional encoding, attention mechanisms, and multimodal representation are likely to further enhance the fidelity and scope of Earth-specific transformers, supporting increasingly holistic approaches to environmental and planetary modeling.

A 3D Earth-Specific Transformer is thus an extensible, physically grounded, and scalable architecture that encodes spatial, temporal, and modal structure of Earth system observations. Its innovations in encoding, aggregation, and fusion have propelled advances in weather forecasting, geospatial asset synthesis, subsurface reconstruction, and spherical vision, with documented improvements in accuracy, efficiency, and adaptability across a range of scientific domains.