ViViT: Video Vision Transformer

Updated 18 December 2025

ViViT is an advanced deep learning architecture that extends the Vision Transformer to capture both spatial and temporal information using tubelet-based 3D convolution embeddings.
It integrates MSI and SAR data early in the pipeline, offering significant improvements in reconstruction error by preserving local spatio-temporal coherence.
The model employs multi-head self-attention over tubelet tokens, ensuring precise feature extraction and robust video representation for tasks like cloud-robust remote sensing.

The Video Vision Transformer (ViViT) is an advanced deep learning architecture designed to perform spatio-temporal fusion for video-based tasks by extending the standard Vision Transformer (ViT) framework to explicitly integrate both spatial and temporal dimensions. ViViT leverages a tubelet-based 3D convolutional embedding to efficiently encode local spatio-temporal information, and incorporates self-attention mechanisms to model both intra- and inter-tubelet dependencies, enabling robust video representation learning. In the context of multispectral image (MSI) time series reconstruction under cloud coverage, ViViT serves as the core of a hybrid MSI-SAR fusion strategy, directly addressing limitations of prior ViT methods that rely on coarse temporal embeddings and suffer from cross-temporal information blurring (Wang et al., 10 Dec 2025).

1. Tubelet Extraction and Spatio-Temporal Embedding

ViViT decomposes an input video or time-series volumetric stack $X \in \mathbb{R}^{T \times C \times H \times W}$ (where $T$ is the number of temporal frames, $C$ number of channels, $H,W$ spatial dimensions) into a sequence of non-overlapping “tubelets.” Each tubelet spans a fixed temporal window ( $t=2$ ) and a fixed spatial patch ( $5 \times 5$ ). The extraction is accomplished via a 3D convolutional layer with kernel size $(2, 5, 5)$ and stride $(2, 5, 5)$ , resulting in tubelets indexed by block $(n, i, j)$ : $X_{\text{tubelet}}^{(n, i, j)}(c, u, v, w) = X(n \cdot 2 + u, c, i \cdot 5 + v, j \cdot 5 + w),\;\; u=0,1;\; v,w=0,...,4$ This operation produces $N_{\text{patches}} = \frac{T}{2} \cdot \frac{H}{5} \cdot \frac{W}{5}$ tubelets, each with local spatio-temporal coherence and minimal cross-period mixing. Tubelets are projected to feature vectors via a linear layer, optionally followed by additional normalization and feature projection.

2. Joint Multimodal Embedding for MSI–SAR Fusion

ViViT supports both single-modal (MSI-only) and multimodal (MSI + SAR) fusion scenarios by stacking all available channels along the input dimension. The 3D convolution tubelet extraction fuses MSI and SAR information at the earliest stage, so the resulting tubelet embeddings inherently encode joint spectral, spatial, and temporal cues without requiring explicit cross-modal attention layers. Each tubelet feature, after linear projection, forms part of the input sequence to the transformer.

3. Positional Encoding and Token Sequence Construction

Each tubelet embedding is enriched with a joint positional encoding $P$ that encodes its temporal block index and spatial patch position. For $d_e$ -dimensional tubelet embeddings, positional encodings $P \in \mathbb{R}^{N_{\text{patches}} \times d_e}$ ensure the transformer is sensitive to both spatial and temporal context: $Z^{(0)} = E + P$ where $E$ is the set of tubelet embeddings. The transformer processes this sequence as a set of “tokens” representing the full video volume in a compact form.

4. Multi-Head Self-Attention Transformer Architecture

ViViT applies $L$ layers of multi-head self-attention (MHSA) to the sequence of tubelet embeddings. In each layer, queries, keys, and values are computed via learned projections, and attention is computed as: $\text{head}_i = \mathrm{Softmax} \left( \frac{Q_i K_i^T}{\sqrt{d_k}} \right) V_i$ Multiple attention heads enable the model to aggregate information from diverse spatial and temporal positions, while the stacking of $L$ blocks captures complex interdependencies across tubelets. This pipeline maintains local temporal coherence and robust spatial representation, preventing the information loss prevalent in coarse temporal aggregation strategies.

5. End-to-End MSI–SAR Video Reconstruction Pipeline

After transformer processing, the sequence of attended tubelet representations is reshaped to reconstruct the predicted MSI time series. The output can be supervised with mean-squared error losses against ground-truth multispectral sequences. In fusion settings, the model is trained and evaluated for spectral reconstruction quality, particularly under conditions of cloud cover that cause missing or corrupted MSI frames.

Empirical Results Table

Model	Modalities	Temporal Embedding	MSE Reduction vs Baseline
MTS-ViT	MSI only	Global (ViT)	–
MTS-ViViT	MSI only	Tubelet (ViViT)	2.23%
SMTS-ViT	MSI + SAR	Global (ViT)	–
SMTS-ViViT	MSI + SAR	Tubelet (ViViT)	10.33%

ViViT outperforms the corresponding ViT-based architectures (MTS-ViT and SMTS-ViT) by significant margins, with a 2.23% MSE reduction in the MSI-only case and a 10.33% improvement in the MSI+SAR fusion setting (Wang et al., 10 Dec 2025).

6. Design Rationale and Comparative Advantages

Unlike prior ViT-based methods such as SMTS-ViT, which aggregate temporal information across entire sequences and induce cross-day blurring, ViViT preserves local temporal structure by constraining each tubelet to two time steps. This design constrains information blending across dates, leading to more precise spectral reconstruction and preservation of sharp transitions in time. Additionally, the 3D tubelet extraction enables robust early fusion of multi-modal inputs within a unified representation space, in contrast to image-only or sequential fusion paradigms.

7. Impact and Application Scope

The tubelet-based ViViT transformer is specifically advantageous for video and temporal image reconstruction tasks where both spatial detail and local temporal correlations are essential. Its tubelet strategy improves the reconstruction of cloud-obscured MSI for agricultural monitoring, but the method is applicable wherever high-fidelity spatio-temporal modeling from volumetric or video input, potentially across sensing modalities, is required. The reduction in reconstruction error and improved temporal smoothness in MTS-ViViT and SMTS-ViViT empirically validate the utility of this approach for cloud-robust remote sensing and more broadly for multimodal video understanding and restoration (Wang et al., 10 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Video Vision Transformer (ViViT).