ViViTs: Video Vision Transformers

Updated 7 April 2026

Video Vision Transformers (ViViTs) are deep learning models that extend Vision Transformers to video by processing spatio-temporal tokens extracted from video clips.
ViViTs leverage spatio-temporal tokenization and various attention mechanisms, enabling efficient handling of tasks like action recognition, fake video detection, and low-label learning.
Variants such as ResidualViT and DualPath implement parameter-efficient adaptations that reduce computational cost while preserving high accuracy across diverse video applications.

Video Vision Transformers (ViViTs) are deep learning architectures that generalize the Vision Transformer (ViT) paradigm from image to video, operating directly on spatio-temporal token sequences derived from video clips. By leveraging various forms of spatio-temporal tokenization and transformer-based attention, ViViT models achieve state-of-the-art results on video understanding tasks such as action recognition, video classification, fake video detection, and temporally dense encoding. The ViViT approach encompasses design variants based on attention factorization, patch embedding, and regularization, as well as adaptations for computational efficiency and parameter efficiency across supervised, low-label, and adversarial settings.

1. Spatio-Temporal Tokenization and Model Architectures

ViViTs operate on video clips represented as tensors $V \in \mathbb{R}^{T \times H \times W \times C}$ , where $T$ is the number of frames, $H \times W$ is the spatial resolution, and $C$ is the channel number (typically 3 for RGB) (Arnab et al., 2021). Key to ViViT is the process of partitioning video into spatio-temporal units:

Tubelet embedding: Non-overlapping cubes (tubelets) of size $t \times h \times w$ are flattened and projected via a learned linear map $E_{3D}$ , yielding tokens $z_i = E_{3D} \cdot \text{vec}(\text{tubelet}_i)$ . The total number of patches is $N = n_t \cdot n_h \cdot n_w$ , with $n_t= \lfloor T / t \rfloor$ , and analogously for $n_h, n_w$ (Arnab et al., 2021, Singh et al., 2022).
Token sequence and positional encoding: A learnable classification token is prepended, and positional embeddings (2D, 3D, or factorized) are added to token vectors: $T$ 0.

Four principal ViViT encoder variants have been established (Arnab et al., 2021):

Joint Spatio-Temporal Attention: Applies global self-attention to all $T$ 1 tokens per layer.
Factorized Encoder: Applies independent spatial attention blocks per frame, followed by temporal attention over frame representations.
Divided Self-Attention: Each layer applies spatial attention across patch tokens within frames, then temporal attention across the same spatial location over time.
Split-Head Attention: Attention heads are divided between spatial-only and temporal-only neighborhoods within each layer.

Empirically, factorized and divided attention designs offer more favorable compute/parameter trade-offs on long video sequences. Recent work has further developed one-stream spatio-temporal transformer blocks, lightweight parameterization (e.g., DualPath adapters), and interleaved efficient inference schemes (e.g., ResidualViT) (Singh et al., 2022, Park et al., 2023, Soldan et al., 16 Sep 2025).

2. Regularization, Initialization, and Optimization Strategies

To ensure competitive generalization, especially on modest-sized video datasets, ViViT architectures depend on strong regularization, augmentation, and transfer from large-scale image pretraining (Arnab et al., 2021, Huang et al., 2021, Singh et al., 2022):

Data augmentation: Includes random cropping, horizontal/vertical flipping, color jitter, Mixup, CutMix, random rotation, Gaussian blur, and temporal jitter. For small datasets (e.g., violence detection), aggressive frame-level augmentation is necessary to compensate for weaker inductive bias (Singh et al., 2022).
Image-to-video weight transfer: Tubelet embeddings are initialized via inflation or central frame copying from ImageNet/Vision Transformer weights; positional embeddings are tiled or interpolated to match the video token count (Arnab et al., 2021, Huang et al., 2021).
Regularization: Utilizes stochastic depth, RandAugment, label smoothing, and dropout.
Optimization: Follows AdamW or synchronous SGD with cosine learning-rate schedules, often with linear warm-up phases. Typical batch sizes are $T$ 2 and training durations range between $T$ 3 epochs depending on task and dataset.

Careful pretraining and strong regularization are critical—standard ViT training without these measures results in overfitting and poor transfer to video (Arnab et al., 2021, Rahman et al., 2022).

3. Specialized ViViT Variants: Efficiency and Adaptation

Several derivative architectures have been proposed to address limitations of vanilla ViViT in efficiency, data efficiency, and domain adaptation:

ResidualViT (Soldan et al., 16 Sep 2025): Employs an interleaved encoding scheme, where every $T$ 4-th frame is fully encoded by a frozen ViT backbone ( $T$ 5), and intermediate frames are processed with a lightweight encoder sharing transformer weights but with only a residual token (derived from previous I-frame) and a subset of spatial tokens (retained via random, uniform, center, or motion-based selection). ResidualViT reduces computational cost by $T$ 6– $T$ 7\% and achieves comparable accuracy to full ViT across temporally dense video tasks.
DualPath adaptation (Park et al., 2023): Introduces separate spatial and temporal lightweight adapters into each transformer block of a frozen backbone. Spatial path adapters are parallel (after attention, MLP), while temporal path adapters process an artificial "grid" of downscaled consecutive frames for improved temporal modeling. DualPath achieves strong results on Kinetics-400 and other benchmarks, requiring only $T$ 8– $T$ 9M trainable parameters (out of total $H \times W$ 0– $H \times W$ 1M).
Parameter-efficient adaptation: Most DualPath parameters are dedicated to bottleneck adapters ( $H \times W$ 2), with $H \times W$ 3 (e.g., $H \times W$ 4 for $H \times W$ 5).
Empirical speed and accuracy: ResidualViT achieves $H \times W$ 6– $H \times W$ 7 speedups at inference with $H \times W$ 8 drop in accuracy on natural language video grounding, activity localization, and audio description (Soldan et al., 16 Sep 2025). DualPath matches or exceeds full fine-tuning methods at $H \times W$ 9\% parameter cost on action recognition and temporally dynamic benchmarks (Park et al., 2023).

4. Application Domains

ViViTs have demonstrated state-of-the-art or highly competitive results on diverse video tasks:

Action Recognition: On Kinetics-400, ViViT-L/16 (factorized FE) achieves $C$ 0 top-1 (IN21k pretrain), exceeding 3D CNNs such as SlowFast and X3D-XXL. On EpicKitchens-100, fine-tuned ViViTs achieve $C$ 1 action, $C$ 2 verb, and $C$ 3 noun accuracy (Arnab et al., 2021).
Violence Detection: A single-stream, spatio-temporal augmented ViViT (8 transformer layers, $C$ 4) achieves $C$ 5 accuracy on Hockey Fight and $C$ 6 on Violent Crowd, outperforming specialized CNN+LSTM, SVM, and Fisher Vector systems (Singh et al., 2022).
Fake Video Detection: A framework leveraging a frozen video ViT backbone with a lightweight Q-Former head/globally pooled prototype scoring achieves robust detection under heavy compression and generalizes across diverse generative video pipelines, maintaining AUC $C$ 7 under CRF $C$ 8 (Battocchio et al., 29 Apr 2025).
Low-Labeled Regimes: ViViT models (with or without attention localization) outperform semi-supervised CNNs leveraging unlabeled data, e.g., achieving $C$ 9 (ViViT-FE) vs $t \times h \times w$ 0 (C2D+NL-R50) on Kinetics-400 with only $t \times h \times w$ 1 labeled data (Rahman et al., 2022).
Temporally Dense Encoding: ResidualViT secures up to $t \times h \times w$ 2 computational savings on frame-level temporal video reasoning at minimal accuracy loss (Soldan et al., 16 Sep 2025).

5. Analysis, Ablation Studies, and Empirical Findings

Empirical results highlight the following core observations:

Inductive biases and pretraining: While ViViT lacks strong local bias of 3D CNNs, spatiospatial pretraining (e.g., ImageNet-1K/21k) compensates, especially in low-labeled and transfer settings (Rahman et al., 2022).
Attention structure trade-offs: Factorized attention (spatial then temporal) lowers compute and parameter count without significantly degrading performance versus joint global attention, especially for longer or higher-resolution inputs (Arnab et al., 2021).
Data augmentation: Heavy augmentation is essential on small datasets for both generalization and preventing representation collapse. Aggressive strategies (color jitter, rotation, Gaussian blur, temporal jitter) are key for specialized domains like violence detection (Singh et al., 2022).
Parameter and FLOPs efficiency: Adapter-based methods (DualPath, ST-Adapter) and token-reduction schemes (ResidualViT) substantially cut training and inference costs (e.g., DualPath on ViT-L/14 requires only $t \times h \times w$ 3M trainable out of $t \times h \times w$ 4M total parameters, with $t \times h \times w$ 5 K400 top-1) (Park et al., 2023).
Ensembling: For action recognition, fusion with advanced convnets (e.g., SlowFast, ir-CSN-152) yields the best test-time metrics, compensating for ViViT's modest verb prediction weakness with convolutional temporal features (Huang et al., 2021).

Key ablations confirm (1) DualPath's spatial and temporal adapters are strongly complementary; (2) increased input resolution and stronger augmentations benefit "noun" accuracy more than "verb" (i.e., entity/object vs motion), and (3) mid-size tubelets (e.g., $t \times h \times w$ 6 for violence detection) balance accuracy and compute (Park et al., 2023, Singh et al., 2022).

6. Limitations, Open Problems, and Future Directions

Despite their success, current ViViT designs exhibit several limitations:

Compute and memory: Global attention over very long token sequences ( $t \times h \times w$ 7) remains resource intensive. Factorized and local-windows variants mitigate but do not fully resolve this for high-fps video or long clips (Arnab et al., 2021, Soldan et al., 16 Sep 2025).
Temporal granularity: Frame stacking or grid "frameset" methods (as in DualPath) trade spatial resolution for temporal coverage, and extreme grid sizes may degrade fine-grained temporal reasoning (Park et al., 2023).
Verb accuracy/motion modeling: On action datasets requiring subtle motion discrimination, even advanced ViViTs tend to lag state-of-the-art CNNs on "verb" prediction, while excelling on "noun" (entity) identification (Huang et al., 2021).
Domain adaptation: Performance on synthetic/real video mixtures or new generative methods is sensitive to the backbone architecture and pretraining data. Ongoing work explores more robust backbone/fusion and domain-adaptive strategies for evolving video generation distributions (Battocchio et al., 29 Apr 2025).
Scalability: While parameter-efficient (adapter-based) fine-tuning and token selection help scale ViViTs to resource-constrained settings, further work is needed on sparse attention and cross-modal video-language adaptation.

7. Conclusion and Impact

ViViT and derived Video Vision Transformers have established a new paradigm in video understanding by extending the scalability, generalization, and non-local receptive field of pure transformers to spatio-temporal domains. Through flexible tokenization, attention factorization, and careful regularization, ViViT achieves or surpasses state-of-the-art results on standard, low-label, adversarial, and efficient-dense encoding tasks. Evolutions—including DualPath and ResidualViT—illustrate how adaptation layers and token-efficient inference assure broad practical applicability, while empirical studies clarify both strengths (few-shot learning, compression robustness, semantic generalization) and ongoing challenges (temporal modeling, compute cost, high-fidelity domain shifts). This corpus of work continues to influence research across surveillance, activity recognition, video grounding, synthetic media detection, and foundation model adaptation (Arnab et al., 2021, Huang et al., 2021, Singh et al., 2022, Rahman et al., 2022, Park et al., 2023, Battocchio et al., 29 Apr 2025, Soldan et al., 16 Sep 2025).