Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video Swin Transformer (VST)

Updated 11 May 2026
  • The paper introduces Video Swin Transformer (VST), a pure Transformer-based video backbone that leverages hierarchical spatio-temporal shifted window attention for enhanced action recognition.
  • It achieves state-of-the-art accuracy, notably gaining +0.7% top-1 improvement on Kinetics-400 while significantly reducing computational complexity compared to global attention models.
  • Transfer learning is streamlined by inflating pretrained Swin image weights to 3D, enabling efficient adaptation to video tasks with reduced memory and faster convergence.

The Video Swin Transformer (VST) is a pure Transformer-based video backbone that introduces hierarchical spatio-temporal windowed self-attention to efficiently and accurately model video data. VST adapts the Swin Transformer design from the image domain to handle video inputs, leveraging 3D shifted windows for local self-attention and hierarchical token aggregation. This architecture achieves state-of-the-art results on major video recognition benchmarks with significant improvements in computational efficiency relative to global self-attention models. Its success is driven by a locality-inductive bias, strategic transfer of image-domain pretrained weights, and empirical design choices validated through ablation and transfer-learning studies (Liu et al., 2021, Oliveira et al., 2022).

1. Spatio-Temporal Hierarchical Architecture

VST processes a tensor input of shape T×H×W×3T \times H \times W \times 3, where TT is the number of frames, and H,WH, W are spatial dimensions (Liu et al., 2021). The input is divided into non-overlapping 3D patches of size 2×4×42 \times 4 \times 4 (temporal × height × width), where each patch is flattened and linearly projected to a CC-dimensional embedding. After this operation, the sequence comprises T2×H4×W4\frac{T}{2} \times \frac{H}{4} \times \frac{W}{4} tokens.

The architecture is organized into four hierarchical stages, with each stage comprising multiple blocks. At each transition between stages, a patch merging layer concatenates 2×22 \times 2 spatial neighborhoods (across height and width), increasing channel dimensionality and halving spatial resolution, while the temporal dimension remains unchanged. The stages, for instance in Swin-T, have 96 channels throughout and employ a (2, 2, 6, 2) block arrangement (Oliveira et al., 2022).

2. 3D Shifted-Window Self-Attention Mechanism

VST employs a specialized block that alternates between regular and shifted 3D windowed multi-head self-attention (MSA). Within each block, the "Pre-Norm" Transformer sequence is retained, but the MSA calculates attention only within local 3D windows of size P×M×MP \times M \times M (time × height × width):

Attention(Q,K,V)=Softmax(QKTd+B)V\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt d} + B\right) V

where the relative position bias BB is derived from a learnable table TT0.

The regular partition divides the input into non-overlapping windows, while the shifted partition offsets the grid by TT1, introducing cross-window communication at minimal computational cost. This shifted windowing mechanism, alternated layer-wise, enables information flow between neighboring spatio-temporal regions and empirically yields a +0.7% top-1 accuracy gain on Kinetics-400 (Liu et al., 2021).

Computational Complexity

Global attention across the entire video input incurs TT2 complexity with spatial tokens TT3 and TT4 temporal tokens. By contrast, windowed attention partitions tokens and reduces the cost to TT5. As TT6 and TT7, this results in substantial reductions in FLOPs compared to models utilizing global space-time attention (e.g., TimeSformer, ViViT) at comparable accuracy.

3. Model Initialization and Pretraining

Due to structural alignment with the Swin-image backbone, VST inherits almost all weights from image-domain models by inflating dimensions as follows:

  • The initial linear embedding expands from TT8 for 2D to TT9 for 3D patches by duplicating and scaling pretrained weights along the temporal axis.
  • The 2D relative-bias table H,WH, W0 is expanded to 3D by duplication across the time axis.

VST backbones are initialized from Swin-image models pretrained on ImageNet-1K or ImageNet-21K, facilitating data- and compute-efficient transfer to the video domain. Pretraining uses AdamW, cosine learning rate decay, a 2.5-epoch warm-up, and input crops of 32-frame video clips with stride 2 and spatial size H,WH, W1. For action datasets, such as Kinetics-400, 30 epochs are typical; for more temporal modeling, e.g., Something-Something v2, schedules employ 60 epochs, with additional data augmentation (RandAugment, label smoothing, random erasing, and stochastic-depth=0.4) (Liu et al., 2021).

4. Empirical Performance and Ablation Analyses

VST achieves state-of-the-art accuracy on several benchmarks, with significant gains in efficiency and model compactness:

Dataset Model / Setup Top-1 Accuracy Params / FLOPs Data / Efficiency
Kinetics-400 Swin-L (384↑), 21K pretrain 84.9% 200M, 2.1T FLOPs 20× less pretrain, 3× smaller than ViViT
Kinetics-600 Swin-L (384↑), 21K pretrain 86.1% 200M, 2.1T FLOPs
Something-Something v2 Swin-B, K400 finetune 69.6% 88.8M, 321G FLOPs +0.9% over MViT-B-24

Ablation studies demonstrate:

  • Joint spatio-temporal design yields 78.8% on K400 (88 GFlops), outperforming split or factorized alternatives.
  • Reducing temporal window size trades marginal accuracy (from 79.1% to 78.6%) for significant computational savings (106G to 79G).
  • Shifted windows consistently outperform spatial-only or no-shift configurations.

5. Transfer Learning Across Domains

A comprehensive transfer-learning study evaluates VST's ability to generalize to datasets outside its pretraining regime (Oliveira et al., 2022). Using a Swin-T VST variant pretrained on Kinetics-400:

  • FCVID (object-centric, avg. 167s videos): Achieves 85% top-1 accuracy, matching state-of-the-art AdaFocus V2, with only the final layer retrained.
  • Something-Something v1 (fine-grained actions): Achieves 21% (significantly lower than from-scratch training at 69.6%).

Transfer learning involves freezing the backbone and retraining only the classification head, yielding a ~4× reduction in GPU memory and convergence in tens of minutes on a single GPU, compared to days for full-model training. Inference cost remains unchanged. This approach generalizes well when the target task is object-centric and classes align with those from pretraining but fails on action-centric, relationally complex datasets.

Performance degrades as input video duration increases (Spearman’s ρ = -0.848 for FCVID), attributed to the fixed 32-frame input, which cannot capture longer temporal context. Resolution has negligible effect (ρ=+0.044), while label frequency weakly correlates with accuracy (ρ=+0.334).

6. Limitations and Design Implications

Several limitations and failure modes are identified:

  • Temporal Sampling Constraint: Fixed short clip length can result in loss of essential context for long or complex videos. Adaptive sampling or multi-clip ensembling is recommended for such settings.
  • Domain Transfer Gap: VST pretrained on object-centric data does not generalize to action-centric datasets, likely due to self-attention features capturing primarily appearance cues rather than fine-grained spatio-temporal relations. Proposed remedies include explicit modeling of spatial relations or relative motion.
  • Class Hierarchy Confusion: In multi-label, hierarchically-structured data (e.g., FCVID), confusion arises within subclasses with imbalanced representation (e.g., amateur vs. professional “soccer”). Incorporating class hierarchy into loss functions or model structure may address this issue (Oliveira et al., 2022).

7. Implementation and Reproducibility Considerations

Reproducible implementation is supported by clear specification of core hyperparameters:

Hyperparameter Typical Value Note
Backbone Swin-T, Swin-B, Swin-L C=96, (2,2,6,2) layers (Swin-T example)
Patch Size H,WH, W2 Time × Height × Width
Window Size e.g., H,WH, W3 3D windows
Batch Size 64
Input Frames 32 Temporal stride: 2
Spatial Crop H,WH, W4
Optimizer AdamW, weight decay ≈0.05
LR Schedule Cosine decay, warmup 2.5
Augmentations RandAugment, label smoothing, random erasing, stochastic depth (for SSv2)
Training Epochs 30 (K400), 60 (SSv2)

Forward pass implementation uses partitioning, (optionally shifted) 3D windowed attention, and MLPs, alternating regular and shifted windows layer-wise (Liu et al., 2021).

In summary, the Video Swin Transformer establishes a new paradigm for video modeling by synergistically combining hierarchical vision transformer design with efficient local spatio-temporal self-attention, delivering leading accuracy and efficiency on diverse video tasks, and indicating key areas for future research including temporal context adaptation and domain-generalizable representations (Liu et al., 2021, Oliveira et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Swin Transformer (VST).