Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audiovisual Masked Autoencoders (2212.05922v3)

Published 9 Dec 2022 in cs.CV and cs.SD
Audiovisual Masked Autoencoders

Abstract: Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Overview of "Audiovisual Masked Autoencoders"

The paper "Audiovisual Masked Autoencoders" proposes a novel approach for self-supervised representation learning using the audiovisual information intrinsic to video data. The authors explore the effectiveness of masked autoencoding, a technique that has demonstrated considerable success in NLP and visual representation tasks, to jointly model the audio and visual modalities. This joint modeling aims to improve the quality of learned representations, enabling superior performance across various downstream tasks, including unimodal and multimodal classification, without the need for labeled datasets.

Key Contributions

  1. Masked Autoencoding for Audiovisual Data: The core contribution lies in extending the masked autoencoding framework to model both audio and visual content concurrently. This involves creating multiple pretraining architectures that can encode and reconstruct audiovisual inputs, thus capturing intricate interactions between modalities.
  2. Pretraining Architectures and Objectives: The paper investigates several architectural configurations and objectives for pretraining, such as early fusion, shared weights, and modality inpainting. These configurations are thoroughly evaluated through ablation studies to select the optimal design choices.
  3. Transferability: The paper demonstrates that the learned audiovisual representations are not only effective for the specific tasks they were pretrained on but also exhibit excellent transferability across different datasets and tasks, achieving state-of-the-art results on datasets such as VGGSound, AudioSet, and Epic Kitchens.
  4. Release of Code and Models: To facilitate further research, the authors have made the models and code accessible, promoting reproducibility and allowing other researchers to build upon their work.

Numerical Results and Claims

The proposed method surpasses existing state-of-the-art results in several tasks. For instance, it achieves significant performance improvements on VGGSound and AudioSet benchmarks, usually obtained without requiring labeled pretraining datasets. More specifically, the model exhibits strong performance in audiovisual classification tasks where it significantly outperforms baselines that utilize unimodal pretraining strategies.

Implications

The paper's findings have practical implications in fields reliant on multimodal data processing, such as video content analysis, multimedia retrieval, and human-computer interaction. Theoretically, it underscores the potential and benefits of leveraging multimodal synergies through self-supervised approaches. By effectively capturing correlations between audio and visual data, the approach could provide a foundation for more nuanced, perception-oriented AI systems.

Future Directions

Future research may focus on enhancing the capacity and efficiency of multimodal transformers deployed in this framework. Exploring larger backbones and integrating novel architectural improvements could elevate performance further. Additionally, addressing modality inpainting challenges and optimizing cross-modal objectives could pave the way for more robust audiovisual models.

In conclusion, this paper presents a comprehensive and effective strategy for harnessing audiovisual information in self-supervised learning, marking a significant advance in the development of versatile and transferrable AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
  2. Self-supervised multimodal versatile networks. In NeurIPS, 2020.
  3. Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
  4. Look, listen and learn. In ICCV, 2017.
  5. Objects that sound. In ECCV, 2018.
  6. ViViT: A video vision transformer. In ICCV, 2021.
  7. Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS, 2020.
  8. Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
  9. MultiMAE: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
  10. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  11. On the opportunities and risks of foundation models. In arXiv preprint arXiv:2108.07258, 2021.
  12. JAX: composable transformations of Python+NumPy programs, 2018.
  13. Language models are few-shot learners. In NeurIPS, 2020.
  14. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  15. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  16. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  17. VGGSound: A large-scale audio-visual dataset. In ICASSP, 2020.
  18. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  19. Masked spectrogram prediction for self-supervised audio pre-training. In arXiv preprint arXiv:2204.12768, 2022.
  20. Electra: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
  21. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 2022.
  22. Scenic: A JAX library for computer vision research and beyond. In CVPR Demo, 2022.
  23. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  24. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  25. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
  26. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  27. Are large-scale datasets necessary for self-supervised pre-training? In arXiv preprint arXiv:2112.10740, 2021.
  28. Large scale audiovisual learning of sounds with weakly labeled data. In arXiv preprint arXiv:2006.01595, 2020.
  29. Masked autoencoders as spatiotemporal learners. In arXiv preprint arXiv:2205.09113, 2022.
  30. Audio Set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
  31. OmniMAE: Single model masked pretraining on images and videos. In arXiv preprint arXiv:2206.08356, 2022.
  32. Omnivore: A single model for many visual modalities. In CVPR, 2022.
  33. AST: Audio spectrogram transformer. In Interspeech, 2021.
  34. PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation. Transactions on Audio, Speech, and Language Processing, 2021.
  35. Contrastive audio-visual masked autoencoder. In arXiv preprint arXiv:2210.07839, 2022.
  36. Accurate, large minibatch sgd: Training imagenet in 1 hour. In arXiv preprint arXiv:1706.02677, 2017.
  37. Bootstrap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020.
  38. ESResNet: Environmental sound classification based on visual domain models. In ICPR, 2021.
  39. Deep image features in music information retrieval. International Journal of Electronics and Telecommunications, 2014.
  40. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  41. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  42. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  43. Deep networks with stochastic depth. In ECCV, 2016.
  44. Mavil: Masked audio-video learners. In arXiv preprint arXiv:2212.08071, 2022.
  45. Masked autoencoders that listen. In arXiv preprint arXiv:2207.06405, 2022.
  46. Perceiver IO: A general architecture for structured inputs & outputs. In ICLR, 2022.
  47. Perceiver: General perception with iterative attention. In ICML, 2021.
  48. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  49. Slow-fast auditory streams for audio recognition. In ICASSP, 2021.
  50. Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
  51. Compressive visual representations. In NeurIPS, 2021.
  52. PolyViT: Co-training vision transformers on images, videos and audio. In arXiv preprint arXiv:2111.12993, 2021.
  53. Tsm: Temporal shift module for efficient video understanding. In CVPR, 2019.
  54. Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
  55. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
  56. Attention bottlenecks for multimodal fusion. In NeurIPS, 2021.
  57. Representation learning with contrastive predictive coding. In arXiv preprint arXiv:1807.03748, 2018.
  58. Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
  59. SpecAugment: A simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pages 2613–2617, 2019.
  60. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  61. On compositions of transformations in contrastive self-supervised learning. In ICCV, 2021.
  62. Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS, 2021.
  63. Learning transferable visual models from natural language supervision. In ICML, 2021.
  64. OWL (observe, watch, listen): Localizing actions in egocentric video via audiovisual temporal context. In BMVC, 2022.
  65. Zero-shot text-to-image generation. In ICML, 2021.
  66. Event-specific audio-visual fusion layers: A simple and new perspective on video understanding. In WACV, 2023.
  67. Crossmodal influences on visual perception. Physics of life reviews, 7(3):269–284, 2010.
  68. Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
  69. Flava: A foundational language and vision alignment model. In CVPR, 2022.
  70. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
  71. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. In arXiv preprint arXiv:2201.11990, 2022.
  72. Play it back: Iterative attention for audio recognition. In arXiv preprint arXiv:2210.11328, 2022.
  73. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
  74. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  75. Audio-visual event localization in unconstrained videos. In ECCV, 2018.
  76. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In arXiv preprint arXiv:2203.12602, 2022.
  77. Attention is all you need. In NeurIPS, 2017.
  78. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
  79. Multimodal self-supervised learning of general audio representations. In arXiv preprint arXiv:2104.12807, 2021.
  80. Bevt: Bert pretraining of video transformers. In CVPR, 2022.
  81. What makes training multi-modal classification networks hard? In CVPR, 2020.
  82. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  83. Multiview transformers for video recognition. In CVPR, 2022.
  84. Florence: A new foundation model for computer vision. In arXiv preprint arXiv:2111.11432, 2021.
  85. MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
  86. mixup: Beyond empirical risk minimization. In ICLR, 2018.
  87. Colorful image colorization. In ECCV, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mariana-Iuliana Georgescu (27 papers)
  2. Eduardo Fonseca (21 papers)
  3. Radu Tudor Ionescu (103 papers)
  4. Cordelia Schmid (206 papers)
  5. Anurag Arnab (56 papers)
  6. Mario Lucic (42 papers)
Citations (40)
Github Logo Streamline Icon: https://streamlinehq.com