VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (2104.11178v3)

Published 22 Apr 2021 in cs.CV, cs.AI, cs.LG, cs.MM, and eess.IV

Abstract: We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.

PDF Abstract

Review of "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

The paper "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text" presents an innovative framework for extracting rich, multimodal representations using Transformer architectures without relying on convolutional layers. The approach is particularly focused on learning from video, audio, and text modalities directly from raw inputs, which is an emerging area in machine learning research.

Methodology

The proposed framework, Video-Audio-Text Transformer (VATT), employs a convolution-free architecture to tackle the challenge of multimodal learning. The model harnesses self-supervised learning techniques, utilizing multimodal contrastive losses to facilitate the training process. It specifically leverages Noise Contrastive Estimation (NCE) and Multiple Instance Learning NCE (MIL-NCE) to align video-audio pairs and video-text pairs respectively. These strategies are adopted to manage unlabeled data effectively.

A key novelty in the architecture is the introduction of a modality-agnostic Transformer. Here, weights are shared among the three different modalities, an approach that challenges the conventional modality-specific designs. Additionally, the concept of DropToken is employed. This involves dropping a percentage of tokens during training to significantly reduce computational complexity without sacrificing performance.

Results

The evaluation of VATT is comprehensive and includes a variety of downstream tasks such as video action recognition, audio event classification, image classification, and zero-shot text-to-video retrieval.

Video Action Recognition: Utilizing datasets like Kinetics-400, Kinetics-600, and Moments in Time, VATT's vision transformer achieves notable top-1 accuracy scores of 82.1%, 83.6%, and 41.1% respectively, setting new benchmarks without the need for supervised pre-training.
Audio Event Recognition: On AudioSet, VATT achieves a new mAP record of 39.4%, further demonstrating its efficacy over existing ConvNet-based solutions.
Image Classification: Despite the domain gap, VATT achieves a 78.7% top-1 accuracy on ImageNet, showcasing impressive transfer learning capabilities.
Zero-Shot Text-to-Video Retrieval: The framework illustrates competitive performance in retrieval tasks, with considerable improvements seen under larger batch sizes and extended training epochs.

The results indicate VATT's robust ability to learn and generalize across different visual and auditory tasks, even when using a single shared Transformer model.

Implications and Future Directions

The implications of VATT are substantial for multimodal research. The demonstrated success of the shared backbone Transformer model suggests potential in developing unified models capable of handling diverse modalities. The approach of avoiding convolutions and focusing on attention mechanisms resonates with trends in NLP and suggests similar gains could be made in the vision and audio domains.

From a practical perspective, the ability to handle raw data inputs directly and perform competitive self-supervised learning could reduce the dependency on labeled data, thus making model training more accessible. Moreover, techniques such as DropToken could lead to more computationally feasible deployments of Transformer models in real-world applications.

Future Developments: The research opens up several avenues for further exploration, including improvements in data augmentation techniques tailored for multimodal data and enhancements in scalability and efficiency, particularly when broadening the input domain to include other data sources beyond video, audio, and text.

In conclusion, VATT signifies a progressive step in multimodal learning, with results underscoring its adaptability and strength. The convergence of self-supervised learning with Transformer architectures holds promise for future innovations in artificial intelligence research and applications.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Hassan Akbari (8 papers)
Liangzhe Yuan (19 papers)
Rui Qian (50 papers)
Wei-Hong Chuang (1 paper)
Shih-Fu Chang (131 papers)
Yin Cui (45 papers)
Boqing Gong (100 papers)

Citations (534)

View on Semantic Scholar

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (2104.11178v3)

Review of "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"

Methodology

Results

Implications and Future Directions

Related Papers