Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

38 3.5k

Audiovisual Masked Autoencoders (2212.05922v3)

Published 9 Dec 2022 in cs.CV and cs.SD

Abstract: Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

PDF HTML Abstract

Overview of "Audiovisual Masked Autoencoders"

The paper "Audiovisual Masked Autoencoders" proposes a novel approach for self-supervised representation learning using the audiovisual information intrinsic to video data. The authors explore the effectiveness of masked autoencoding, a technique that has demonstrated considerable success in NLP and visual representation tasks, to jointly model the audio and visual modalities. This joint modeling aims to improve the quality of learned representations, enabling superior performance across various downstream tasks, including unimodal and multimodal classification, without the need for labeled datasets.

Key Contributions

Masked Autoencoding for Audiovisual Data: The core contribution lies in extending the masked autoencoding framework to model both audio and visual content concurrently. This involves creating multiple pretraining architectures that can encode and reconstruct audiovisual inputs, thus capturing intricate interactions between modalities.
Pretraining Architectures and Objectives: The paper investigates several architectural configurations and objectives for pretraining, such as early fusion, shared weights, and modality inpainting. These configurations are thoroughly evaluated through ablation studies to select the optimal design choices.
Transferability: The paper demonstrates that the learned audiovisual representations are not only effective for the specific tasks they were pretrained on but also exhibit excellent transferability across different datasets and tasks, achieving state-of-the-art results on datasets such as VGGSound, AudioSet, and Epic Kitchens.
Release of Code and Models: To facilitate further research, the authors have made the models and code accessible, promoting reproducibility and allowing other researchers to build upon their work.

Numerical Results and Claims

The proposed method surpasses existing state-of-the-art results in several tasks. For instance, it achieves significant performance improvements on VGGSound and AudioSet benchmarks, usually obtained without requiring labeled pretraining datasets. More specifically, the model exhibits strong performance in audiovisual classification tasks where it significantly outperforms baselines that utilize unimodal pretraining strategies.

Implications

The paper's findings have practical implications in fields reliant on multimodal data processing, such as video content analysis, multimedia retrieval, and human-computer interaction. Theoretically, it underscores the potential and benefits of leveraging multimodal synergies through self-supervised approaches. By effectively capturing correlations between audio and visual data, the approach could provide a foundation for more nuanced, perception-oriented AI systems.

Future Directions

Future research may focus on enhancing the capacity and efficiency of multimodal transformers deployed in this framework. Exploring larger backbones and integrating novel architectural improvements could elevate performance further. Additionally, addressing modality inpainting challenges and optimizing cross-modal objectives could pave the way for more robust audiovisual models.

In conclusion, this paper presents a comprehensive and effective strategy for harnessing audiovisual information in self-supervised learning, marking a significant advance in the development of versatile and transferrable AI models.

PDF Markdown Bookmark Chat (Pro)

References (87)

Authors (6)

Mariana-Iuliana Georgescu (27 papers)
Eduardo Fonseca (21 papers)
Radu Tudor Ionescu (103 papers)
Cordelia Schmid (206 papers)
Anurag Arnab (56 papers)
Mario Lucic (42 papers)

Citations (40)

View on Semantic Scholar

GitHub

GitHub - google-research/scenic: Scenic: A Jax Library for Computer Vision Research and Beyond (3,580 stars)

Tweets

https://twitter.com/edfonseca_/status/1748031743804748075

https://twitter.com/knishimae0531/status/1748133538782973974