Spatio-Temporal Vision Transformer (ViT-g)

Updated 11 December 2025

Spatio-Temporal Vision Transformer (ViT-g) is a model that employs factorized temporal and spatial attention to capture dynamic visual data.
It leverages fine-grained patch tokenization and date-specific positional encodings to handle irregular satellite image time series.
The architecture achieves state-of-the-art benchmarks in semantic segmentation and yield forecasting with isolated per-class tokens.

Spatio-Temporal Vision Transformer (ViT-g) architectures generalize the Vision Transformer framework by integrating temporal and spatial attention mechanisms for modeling dynamic visual processes, particularly in satellite image time series (SITS) and spatio-temporal remote sensing. Core technical advances center on factorized temporo-spatial Transformers, fine-grained patch tokenization, date-specific positional encodings, per-class tokens, and cross-modal integration for multivariate prediction tasks. State-of-the-art results have been demonstrated across semantic segmentation, classification, and crop-yield forecasting benchmarks, establishing the ViT-g design as a blueprint for spatio-temporal modeling with Transformer-based architectures (Tarasiou et al., 2023, Lin et al., 2023).

1. Foundational Architecture and Tokenization

ViT-g, as formalized in the Temporo-Spatial Vision Transformer (TSViT), processes an input tensor $X \in \mathbb{R}^{T \times H \times W \times C}$ —with $T$ temporal steps (acquisition dates), $H \times W$ spatial grids, and $C$ spectral bands. Tokenization divides $X$ into non-overlapping patches along both space and (optionally) time. Typically, $t=1$ (one date per token), with small spatial patches $(h,w)\in\{2,3\}$ to retain boundary fidelity:

$N = \left\lfloor\frac{T}{t}\right\rfloor \left\lfloor\frac{H}{h}\right\rfloor \left\lfloor\frac{W}{w}\right\rfloor.$

Each patch $X_i$ is flattened and linearly projected:

$z_i = W_e\,\mathrm{vec}(X_i) + b_e,\quad z_i \in \mathbb{R}^d.$

For per-frame tokenization, the patch grid is reshaped to preserve temporal order: $Z_T \in \mathbb{R}^{(N_H N_W)\times N_T \times d}$ , with $N_H = H/h$ , $N_W = W/w$ (Tarasiou et al., 2023).

2. Factorized Temporo-Spatial Attention Mechanism

The core innovation of ViT-g architectures is the factorization of the attention mechanism into two sequential modules:

Temporal encoder: For each spatial location, temporal sequences are processed using Transformer blocks augmented with $K$ learnable class tokens and acquisition-date-specific positional encodings. The temporal transformer outputs class embeddings for each spatial location.
Spatial encoder: The output is transposed; for each class $k$ , the spatial sequence (including the class token) is processed with another stack of Transformer layers, using spatial positional encodings. Cross-class attention is typically restricted to avoid destructive interference.

The update equations for each Transformer block layer use multi-head self-attention (MSA), LayerNorm, and an MLP, each with skip connections:

$Y^{(\ell)} = \mathrm{MSA}(\mathrm{LN}(Z^{(\ell)})) + Z^{(\ell)}; \quad Z^{(\ell+1)} = \mathrm{MLP}(\mathrm{LN}(Y^{(\ell)})) + Y^{(\ell)}.$

Ablation studies confirm that "Temporal→Spatial" factorization is superior to "Spatial→Temporal" (e.g., 78.5% vs 48.8% mIoU on Germany semantic segmentation dataset (Tarasiou et al., 2023)), especially when temporal variation encodes critical class separation signals.

3. Positional Encoding and Class Token Strategies

ViT-g relies on two unique architectural elements:

Acquisition-time-specific temporal positional encodings: Rather than fixed encodings, ViT-g uses a lookup table $P_T \in \mathbb{R}^{T'\times d}$ , indexed by the actual date of each observation. This allows the model to remain robust to irregular revisit intervals, as are common in real satellite missions.
Multiple learnable class tokens: $K$ $K$ distinct class tokens (one for each target class) are trained end-to-end in both the temporal and spatial attention stages:
- Temporal: $Z_{T\mathrm{cls}} \in \mathbb{R}^{K\times d}$
- Spatial: $Z_{S\mathrm{cls}} \in \mathbb{R}^{K\times 1\times d}$

Maintaining class-wise isolation in spatial attention (i.e., forbidding cross-class interactions) further improves class separation and feature clarity. Empirical results show the combination of $K$ tokens and date-specific encodings delivers the highest reported mIoU scores (83.6% on Germany dataset) (Tarasiou et al., 2023).

A distinct but related approach is the Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT) for crop-yield regression, which processes satellite imagery and both short- and long-term meteorological data (Lin et al., 2023). Its architecture comprises:

Multi-Modal Transformer: Jointly encodes visual patch embeddings and per-grid local meteorology using a Pyramid Vision Transformer (PVT) backbone, where visual tokens attend to weather keys/values via multi-modal MHA.
Spatial Transformer: Aggregates grid-level temporal representations to a compact sequence using spatial MHA.
Temporal Transformer: Models long-range dependency over time steps, with an additional learned bias $\Pi_t(y_l)$ to encode long-term climate effects.
Self-supervised pre-training: Employs a SimCLR-style contrastive loss on random augmentations of input pairs to mitigate overfitting, addressing the limited labeled data in agricultural yield prediction.

This broader formulation illustrates the extensibility of ViT-g models to integration across complementary spatio-temporal and non-imaging modalities.

5. Empirical Performance and Benchmarks

ViT-g and MMST-ViT achieve state-of-the-art results in several real-world SITS and crop-yield benchmarks:

Dataset	Task	ViT-g Metric	Previous Best	MMST-ViT (Regression, Corr.)	Baseline Corr.
Germany	Seg, mIoU	84.8	77.1	—	—
Germany	Class, mAcc	88.1	82.2	—	—
PASTIS	Seg, mIoU	65.1	63.1	—	—
T31TFM	Seg, mIoU	63.1	60.7	—	—
Corn	Yield, Corr.	—	—	0.900	≤ 0.854
Soybean	Yield, Corr.	—	—	0.918	≤ 0.865

Reported metrics include overall accuracy, mean intersection over union (mIoU) for segmentation, mean accuracy (mAcc) for classification, and Pearson correlation for yield regression (Tarasiou et al., 2023, Lin et al., 2023).

6. Design Principles and Transferable Insights

Empirical studies and ablation analyses establish several transferable principles for generalized ViT-g architectures:

Temporal-first factorization should be preferred when temporally-varying signals dominate at fixed spatial locations.
Small spatial patches ( $2 \times 2$ or $3 \times 3$ ) are critical for preserving fine structure, particularly in segmentation tasks.
Acquisition-date-specific temporal encoding is essential for robustness to calendar irregularity.
Multiple class tokens prevent destructive class mixing and increase per-class model capacity.
Isolation of class tokens in the spatial attention module sharpens class-specific predictions (Tarasiou et al., 2023).

These principles underpin the current best practices for constructing ViT-g models for dynamic spatial data contexts.

7. Computational Complexity and Scalability

The dominant computational costs in ViT-g and MMST-ViT stem from the quadratic scaling of multi-head attention in temporal/spatial sequence lengths. However, careful choice of spatial reduction ratios (in PVT backbones), limited patch/grid counts, and shallow encoder stacks ensure practical scalability:

Attention over patches/grids at each stage: $O(N_\text{patch}^2 d)$ , $O(T^2 d)$ .
PVT and similar efficient visual backbones further control the memory and runtime cost, enabling deployment on standard GPUs.
Pre-training with large batches (e.g., 512 pairs) enables robust representation learning for small-to-moderate data regimes (Lin et al., 2023).

A plausible implication is that further optimizations in attention sparsity or hierarchical designs could extend ViT-g to even larger spatio-temporal domains.