Video Swin Transformer (VST)

Updated 11 May 2026

The paper introduces Video Swin Transformer (VST), a pure Transformer-based video backbone that leverages hierarchical spatio-temporal shifted window attention for enhanced action recognition.
It achieves state-of-the-art accuracy, notably gaining +0.7% top-1 improvement on Kinetics-400 while significantly reducing computational complexity compared to global attention models.
Transfer learning is streamlined by inflating pretrained Swin image weights to 3D, enabling efficient adaptation to video tasks with reduced memory and faster convergence.

The Video Swin Transformer (VST) is a pure Transformer-based video backbone that introduces hierarchical spatio-temporal windowed self-attention to efficiently and accurately model video data. VST adapts the Swin Transformer design from the image domain to handle video inputs, leveraging 3D shifted windows for local self-attention and hierarchical token aggregation. This architecture achieves state-of-the-art results on major video recognition benchmarks with significant improvements in computational efficiency relative to global self-attention models. Its success is driven by a locality-inductive bias, strategic transfer of image-domain pretrained weights, and empirical design choices validated through ablation and transfer-learning studies (Liu et al., 2021, Oliveira et al., 2022).

1. Spatio-Temporal Hierarchical Architecture

VST processes a tensor input of shape $T \times H \times W \times 3$ , where $T$ is the number of frames, and $H, W$ are spatial dimensions (Liu et al., 2021). The input is divided into non-overlapping 3D patches of size $2 \times 4 \times 4$ (temporal × height × width), where each patch is flattened and linearly projected to a $C$ -dimensional embedding. After this operation, the sequence comprises $\frac{T}{2} \times \frac{H}{4} \times \frac{W}{4}$ tokens.

The architecture is organized into four hierarchical stages, with each stage comprising multiple blocks. At each transition between stages, a patch merging layer concatenates $2 \times 2$ spatial neighborhoods (across height and width), increasing channel dimensionality and halving spatial resolution, while the temporal dimension remains unchanged. The stages, for instance in Swin-T, have 96 channels throughout and employ a (2, 2, 6, 2) block arrangement (Oliveira et al., 2022).

2. 3D Shifted-Window Self-Attention Mechanism

VST employs a specialized block that alternates between regular and shifted 3D windowed multi-head self-attention (MSA). Within each block, the "Pre-Norm" Transformer sequence is retained, but the MSA calculates attention only within local 3D windows of size $P \times M \times M$ (time × height × width):

$\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt d} + B\right) V$

where the relative position bias $B$ is derived from a learnable table $T$ 0.

The regular partition divides the input into non-overlapping windows, while the shifted partition offsets the grid by $T$ 1, introducing cross-window communication at minimal computational cost. This shifted windowing mechanism, alternated layer-wise, enables information flow between neighboring spatio-temporal regions and empirically yields a +0.7% top-1 accuracy gain on Kinetics-400 (Liu et al., 2021).

Computational Complexity

Global attention across the entire video input incurs $T$ 2 complexity with spatial tokens $T$ 3 and $T$ 4 temporal tokens. By contrast, windowed attention partitions tokens and reduces the cost to $T$ 5. As $T$ 6 and $T$ 7, this results in substantial reductions in FLOPs compared to models utilizing global space-time attention (e.g., TimeSformer, ViViT) at comparable accuracy.

3. Model Initialization and Pretraining

Due to structural alignment with the Swin-image backbone, VST inherits almost all weights from image-domain models by inflating dimensions as follows:

The initial linear embedding expands from $T$ 8 for 2D to $T$ 9 for 3D patches by duplicating and scaling pretrained weights along the temporal axis.
The 2D relative-bias table $H, W$ 0 is expanded to 3D by duplication across the time axis.

VST backbones are initialized from Swin-image models pretrained on ImageNet-1K or ImageNet-21K, facilitating data- and compute-efficient transfer to the video domain. Pretraining uses AdamW, cosine learning rate decay, a 2.5-epoch warm-up, and input crops of 32-frame video clips with stride 2 and spatial size $H, W$ 1. For action datasets, such as Kinetics-400, 30 epochs are typical; for more temporal modeling, e.g., Something-Something v2, schedules employ 60 epochs, with additional data augmentation (RandAugment, label smoothing, random erasing, and stochastic-depth=0.4) (Liu et al., 2021).

4. Empirical Performance and Ablation Analyses

VST achieves state-of-the-art accuracy on several benchmarks, with significant gains in efficiency and model compactness:

Dataset	Model / Setup	Top-1 Accuracy	Params / FLOPs	Data / Efficiency
Kinetics-400	Swin-L (384↑), 21K pretrain	84.9%	200M, 2.1T FLOPs	20× less pretrain, 3× smaller than ViViT
Kinetics-600	Swin-L (384↑), 21K pretrain	86.1%	200M, 2.1T FLOPs
Something-Something v2	Swin-B, K400 finetune	69.6%	88.8M, 321G FLOPs	+0.9% over MViT-B-24

Ablation studies demonstrate:

Joint spatio-temporal design yields 78.8% on K400 (88 GFlops), outperforming split or factorized alternatives.
Reducing temporal window size trades marginal accuracy (from 79.1% to 78.6%) for significant computational savings (106G to 79G).
Shifted windows consistently outperform spatial-only or no-shift configurations.

5. Transfer Learning Across Domains

A comprehensive transfer-learning study evaluates VST's ability to generalize to datasets outside its pretraining regime (Oliveira et al., 2022). Using a Swin-T VST variant pretrained on Kinetics-400:

FCVID (object-centric, avg. 167s videos): Achieves 85% top-1 accuracy, matching state-of-the-art AdaFocus V2, with only the final layer retrained.
Something-Something v1 (fine-grained actions): Achieves 21% (significantly lower than from-scratch training at 69.6%).

Transfer learning involves freezing the backbone and retraining only the classification head, yielding a ~4× reduction in GPU memory and convergence in tens of minutes on a single GPU, compared to days for full-model training. Inference cost remains unchanged. This approach generalizes well when the target task is object-centric and classes align with those from pretraining but fails on action-centric, relationally complex datasets.

Performance degrades as input video duration increases (Spearman’s ρ = -0.848 for FCVID), attributed to the fixed 32-frame input, which cannot capture longer temporal context. Resolution has negligible effect (ρ=+0.044), while label frequency weakly correlates with accuracy (ρ=+0.334).

6. Limitations and Design Implications

Several limitations and failure modes are identified:

Temporal Sampling Constraint: Fixed short clip length can result in loss of essential context for long or complex videos. Adaptive sampling or multi-clip ensembling is recommended for such settings.
Domain Transfer Gap: VST pretrained on object-centric data does not generalize to action-centric datasets, likely due to self-attention features capturing primarily appearance cues rather than fine-grained spatio-temporal relations. Proposed remedies include explicit modeling of spatial relations or relative motion.
Class Hierarchy Confusion: In multi-label, hierarchically-structured data (e.g., FCVID), confusion arises within subclasses with imbalanced representation (e.g., amateur vs. professional “soccer”). Incorporating class hierarchy into loss functions or model structure may address this issue (Oliveira et al., 2022).

7. Implementation and Reproducibility Considerations

Reproducible implementation is supported by clear specification of core hyperparameters:

Hyperparameter	Typical Value	Note
Backbone	Swin-T, Swin-B, Swin-L	C=96, (2,2,6,2) layers (Swin-T example)
Patch Size	$H, W$ 2	Time × Height × Width
Window Size	e.g., $H, W$ 3	3D windows
Batch Size	64
Input Frames	32	Temporal stride: 2
Spatial Crop	$H, W$ 4
Optimizer	AdamW, weight decay ≈0.05
LR Schedule	Cosine decay, warmup 2.5
Augmentations	RandAugment, label smoothing, random erasing, stochastic depth (for SSv2)
Training Epochs	30 (K400), 60 (SSv2)

Forward pass implementation uses partitioning, (optionally shifted) 3D windowed attention, and MLPs, alternating regular and shifted windows layer-wise (Liu et al., 2021).

In summary, the Video Swin Transformer establishes a new paradigm for video modeling by synergistically combining hierarchical vision transformer design with efficient local spatio-temporal self-attention, delivering leading accuracy and efficiency on diverse video tasks, and indicating key areas for future research including temporal context adaptation and domain-generalizable representations (Liu et al., 2021, Oliveira et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Video Swin Transformer (2021)

Transfer-learning for video classification: Video Swin Transformer on multiple domains (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Swin Transformer (VST).

Video Swin Transformer (VST)

1. Spatio-Temporal Hierarchical Architecture

2. 3D Shifted-Window Self-Attention Mechanism

Computational Complexity

3. Model Initialization and Pretraining

4. Empirical Performance and Ablation Analyses

5. Transfer Learning Across Domains

6. Limitations and Design Implications

7. Implementation and Reproducibility Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Video Swin Transformer (VST)

1. Spatio-Temporal Hierarchical Architecture

2. 3D Shifted-Window Self-Attention Mechanism

Computational Complexity

3. Model Initialization and Pretraining

4. Empirical Performance and Ablation Analyses

5. Transfer Learning Across Domains

6. Limitations and Design Implications

7. Implementation and Reproducibility Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research