Self-Supervised MultiModal Versatile Networks (2006.16228v2)

Published 29 Jun 2020 in cs.CV

Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Authors (9)

Jean-Baptiste Alayrac (38 papers)
Adrià Recasens (19 papers)
Rosalia Schneider (5 papers)
Jason Ramapuram (23 papers)
Jeffrey De Fauw (6 papers)
Lucas Smaira (9 papers)
Sander Dieleman (29 papers)
Andrew Zisserman (248 papers)
Relja Arandjelović (18 papers)

Citations (359)

View on Semantic Scholar

Summary

Self-Supervised MultiModal Versatile Networks

The paper "Self-Supervised MultiModal Versatile Networks" presents an advanced approach to multimodal representation learning by leveraging the diverse and naturally available modalities in videos—namely, visual, audio, and language streams. This effort stands as a notable contribution to the domain of self-supervised learning, aiming to exploit the rich multimodal supervision inherent in unlabelled video datasets.

The primary objective laid out by the authors is to develop a "multimodal versatile network" capable of multifaceted input processing and feature embedding. The research highlights the significant challenge of balancing the integration of fine-grained audio and visual representations with text into a unified model. To fulfill the objective, the authors propose a novel architectural design and training strategy for these networks which respects the inherent properties of each modality while facilitating cross-modality understanding.

Key Findings and Methodology

The paper introduces multiple network configurations optimized to handle diverse input types and perform multi-task learning across modalities. In particular, the proposed "Fine and Coarse" (FAC) model architecture emerges as a favorable strategy. This configuration involves two distinct embedding spaces—a fine-grained space for visual-audio modalities, and a coarse-grained space connecting all three modalities—as a solution to align the distinct natures of audio-visual and textual data.

Training such models employs a multimodal contrastive learning approach that capitalizes on naturally synchronized data points in videos as positive pairs without the necessity for manual annotation. Through a loss function designed to optimize these modality embeddings, the model supports robust cross-modal retrieval and classification capabilities.

Results and Performance

Upon evaluation, the multimodal versatile networks demonstrate state-of-the-art performance across various challenging benchmarks. When compared to existing self-supervised methodologies, the networks achieve superior accuracy on action classification tasks such as UCF101 and HMDB51, as well as on audio classification tasks such as ESC-50. The research asserts the effectiveness of their model through extensive empirical validation, where training on combined large datasets like HowTo100M and AudioSet further underscores the model's capacity to harness unlabelled data at scale.

Additionally, the paper enhances the versatility of video networks by introducing a deflation process, enabling these models to efficiently handle still images. This is significant for tasks requiring both video and image data analysis, reflecting the adaptability of the proposed framework.

Implications and Future Directions

By demonstrating a self-supervised model that integrates multimodal data effectively, this research opens promising avenues for future developments in artificial intelligence systems. The approach aligns with growing trends toward scalable machine learning paradigms that minimize reliance on labeled datasets. Potential applications range from improving content search and retrieval systems to developing more nuanced artificial perception systems.

The insights revealed by this paper could inspire successive refinements in model architectures and training methodologies, enhancing their applicability to a broader range of datasets and tasks. As the field progresses, exploring additional modalities and further improving the integration strategy may continue to unravel the full potential of multimodal learning within AI systems.

PDF Markdown