Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAViL: Masked Audio-Video Learners (2212.08071v2)

Published 15 Dec 2022 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Po-Yao Huang (31 papers)
  2. Vasu Sharma (31 papers)
  3. Hu Xu (87 papers)
  4. Chaitanya Ryali (4 papers)
  5. Haoqi Fan (33 papers)
  6. Yanghao Li (43 papers)
  7. Shang-Wen Li (55 papers)
  8. Gargi Ghosh (30 papers)
  9. Jitendra Malik (210 papers)
  10. Christoph Feichtenhofer (52 papers)
Citations (42)
Github Logo Streamline Icon: https://streamlinehq.com