ViViT: A Video Vision Transformer (2103.15691v2)

Published 29 Mar 2021 in cs.CV

Abstract: We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit

PDF Abstract

ViViT: A Video Vision Transformer

In the paper "ViViT: A Video Vision Transformer," authored by Anurag Arnab and colleagues at Google Research, the researchers investigate transformer-based architectures for video classification, leveraging the recent success of these models in image classification domains.

Model Architecture and Methodologies

The paper introduces several pure-transformer models that extract spatio-temporal tokens from video inputs. These tokens are then processed through multiple transformer layers. Given the computational challenges associated with long sequences of tokens, the authors propose several efficient variants of their models, which factorize the spatial and temporal dimensions of the input.

The detailed architectures are:

Model 1: Spatio-temporal Attention
- This model uses the standard self-attention mechanism to process all spatio-temporal tokens simultaneously. While this approach can model long-range dependencies effectively from the first layer, it is computationally expensive due to its quadratic complexity with respect to the number of tokens.
Model 2: Factorized Encoder
- This architecture breaks down the computation into two encoder stages: a spatial encoder that operates on tokens from single temporal indices and a temporal encoder that processes the spatial encoder's outputs. This factorization reduces computational complexity and has shown to effectively model temporal interactions.
Model 3: Factorized Self-attention
- Unlike the previous model, factorized self-attention operates within each transformer layer, where self-attention is computed separately for spatial and temporal tokens. This reduction in computational complexity allows for scalability while preserving effective spatio-temporal interactions.
Model 4: Factorized Dot-product Attention
- This model splits attention heads to compute dot-product attention over either the spatial or temporal dimensions. This method retains the original model complexity while reducing computational overhead, facilitating the handling of extensive video sequences.

Token Embedding Techniques

The researchers examined two main approaches for token extraction:

Uniform Frame Sampling: Video frames are uniformly sampled, and each frame is independently tokenized using an image transformer model.
Tubelet Embedding: Spatio-temporal tubes from video frames are tokenized as three-dimensional patches, inherently integrating spatial and temporal signals.

Regularization and Pretraining

The paper acknowledges that transformer-based models typically require large-scale datasets for effective training due to their lack of inductive biases found in convolutional networks. To address this, the authors leverage pre-trained image models and introduce strong regularization techniques, including stochastic depth, random augmentations, label smoothing, and mixup, to mitigate overfitting on smaller datasets.

Evaluation and Results

The ViViT models were evaluated on several benchmark video classification datasets:

Kinetics 400 and 600: The models achieved top-1 accuracies of 84.9% and 85.8%, respectively, setting a new state-of-the-art. The use of large-scale pretraining on datasets such as JFT further boosts these numbers.
Epic Kitchens 100: The ViViT model variants outperformed existing benchmarks significantly, particularly excelling in top-1 accuracy for noun classification.
Moments in Time: The ViViT models also set new records, substantially improving the top-1 accuracy compared to previous methods.
Something-Something v2: The models achieved state-of-the-art top-1 accuracy, emphasizing the capability of ViViT models to handle fine-grained motion information crucial for this dataset.

Implications and Future Directions

The introduction of pure-transformer models for video classification marks a significant shift from traditional convolutional approaches. The paper highlights that factorizing models along spatial-temporal dimensions can achieve a balance between efficiency and accuracy.

The future directions suggest using more sophisticated pretraining strategies tailored for video and extending the model architecture to address more complex video understanding tasks beyond classification, such as action detection and video captioning. The flexibility of transformer architectures could be further leveraged to integrate multi-modal information, combining visual, auditory, and textual data for comprehensive video analysis.

In summary, "ViViT: A Video Vision Transformer" presents a comprehensive paper on adapting transformer-based models for video classification, demonstrating their potential to surpass conventional CNN-based methods through innovative architectural adjustments and extensive pretraining regimens.