Audiovisual Masked Autoencoders
Abstract: Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
- VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
- Self-supervised multimodal versatile networks. In NeurIPS, 2020.
- Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
- Look, listen and learn. In ICCV, 2017.
- Objects that sound. In ECCV, 2018.
- ViViT: A video vision transformer. In ICCV, 2021.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS, 2020.
- Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
- MultiMAE: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- On the opportunities and risks of foundation models. In arXiv preprint arXiv:2108.07258, 2021.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Language models are few-shot learners. In NeurIPS, 2020.
- Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- VGGSound: A large-scale audio-visual dataset. In ICASSP, 2020.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Masked spectrogram prediction for self-supervised audio pre-training. In arXiv preprint arXiv:2204.12768, 2022.
- Electra: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.
- Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. IJCV, 2022.
- Scenic: A JAX library for computer vision research and beyond. In CVPR Demo, 2022.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Unsupervised visual representation learning by context prediction. In ICCV, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Are large-scale datasets necessary for self-supervised pre-training? In arXiv preprint arXiv:2112.10740, 2021.
- Large scale audiovisual learning of sounds with weakly labeled data. In arXiv preprint arXiv:2006.01595, 2020.
- Masked autoencoders as spatiotemporal learners. In arXiv preprint arXiv:2205.09113, 2022.
- Audio Set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
- OmniMAE: Single model masked pretraining on images and videos. In arXiv preprint arXiv:2206.08356, 2022.
- Omnivore: A single model for many visual modalities. In CVPR, 2022.
- AST: Audio spectrogram transformer. In Interspeech, 2021.
- PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation. Transactions on Audio, Speech, and Language Processing, 2021.
- Contrastive audio-visual masked autoencoder. In arXiv preprint arXiv:2210.07839, 2022.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. In arXiv preprint arXiv:1706.02677, 2017.
- Bootstrap your own latent - a new approach to self-supervised learning. In NeurIPS, 2020.
- ESResNet: Environmental sound classification based on visual domain models. In ICPR, 2021.
- Deep image features in music information retrieval. International Journal of Electronics and Telecommunications, 2014.
- Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Deep networks with stochastic depth. In ECCV, 2016.
- Mavil: Masked audio-video learners. In arXiv preprint arXiv:2212.08071, 2022.
- Masked autoencoders that listen. In arXiv preprint arXiv:2207.06405, 2022.
- Perceiver IO: A general architecture for structured inputs & outputs. In ICLR, 2022.
- Perceiver: General perception with iterative attention. In ICML, 2021.
- Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- Slow-fast auditory streams for audio recognition. In ICASSP, 2021.
- Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
- Compressive visual representations. In NeurIPS, 2021.
- PolyViT: Co-training vision transformers on images, videos and audio. In arXiv preprint arXiv:2111.12993, 2021.
- Tsm: Temporal shift module for efficient video understanding. In CVPR, 2019.
- Exploring the limits of weakly supervised pretraining. In ECCV, 2018.
- End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
- Attention bottlenecks for multimodal fusion. In NeurIPS, 2021.
- Representation learning with contrastive predictive coding. In arXiv preprint arXiv:1807.03748, 2018.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
- SpecAugment: A simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pages 2613–2617, 2019.
- Context encoders: Feature learning by inpainting. In CVPR, 2016.
- On compositions of transformations in contrastive self-supervised learning. In ICCV, 2021.
- Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- OWL (observe, watch, listen): Localizing actions in egocentric video via audiovisual temporal context. In BMVC, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- Event-specific audio-visual fusion layers: A simple and new perspective on video understanding. In WACV, 2023.
- Crossmodal influences on visual perception. Physics of life reviews, 7(3):269–284, 2010.
- Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13–29, 2005.
- Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. In arXiv preprint arXiv:2201.11990, 2022.
- Play it back: Iterative attention for audio recognition. In arXiv preprint arXiv:2210.11328, 2022.
- Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- Audio-visual event localization in unconstrained videos. In ECCV, 2018.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In arXiv preprint arXiv:2203.12602, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
- Multimodal self-supervised learning of general audio representations. In arXiv preprint arXiv:2104.12807, 2021.
- Bevt: Bert pretraining of video transformers. In CVPR, 2022.
- What makes training multi-modal classification networks hard? In CVPR, 2020.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Multiview transformers for video recognition. In CVPR, 2022.
- Florence: A new foundation model for computer vision. In arXiv preprint arXiv:2111.11432, 2021.
- MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
- mixup: Beyond empirical risk minimization. In ICLR, 2018.
- Colorful image colorization. In ECCV, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.