Siamese Vision Transformers are Scalable Audio-visual Learners (2403.19638v1)
Abstract: Traditional audio-visual methods rely on independent audio and visual backbones, which is costly and not scalable. In this work, we investigate using an audio-visual siamese network (AVSiam) for efficient and scalable audio-visual pretraining. Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes. We pretrain our model using a contrastive audio-visual matching objective with a multi-ratio random masking scheme, which enables our model to process larger audio-visual instance batches, helpful for contrastive learning. Unlike prior audio-visual methods, our method can robustly handle audio, visual, and audio-visual inputs with a single shared ViT backbone. Furthermore, despite using the shared backbone for both modalities, AVSiam achieves competitive or even better results than prior methods on AudioSet and VGGSound for audio-visual classification and retrieval. Our code is available at https://github.com/GenjiB/AVSiam
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
- Self-supervised multimodal versatile networks. In NeurIPS, 2020.
- Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
- Look, listen and learn. In ICCV, 2017.
- Objects that sound. In ECCV, 2018.
- Labelling unlabelled videos from scratch with multi-modal self-supervision. In NeurIPS, 2020.
- Soundnet: Learning sound representations from unlabeled video. In NeurIPS, 2016.
- Mae-ast: Masked autoencoding audio spectrogram transformer. In INTEERSPEECH, 2022.
- Multimae: Multi-modal multi-task masked autoencoders. In ECCV, 2022.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML, 2022.
- Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023.
- Distilling audio-visual knowledge by compositional contrastive learning. In CVPR, 2021.
- Joint-modal label denoising for weakly-supervised audio-visual video parsing. In ECCV, 2022.
- One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code. arXiv Preprint, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
- Large scale audiovisual learning of sounds with weakly labeled data. In IJCAI, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
- Audiovisual masked autoencoders. In ICCV, 2023.
- Omnimae: Single model masked pretraining on images and videos. In CVPR, 2023.
- Omnivore: A single model for many visual modalities. In CVPR, 2022.
- AST: Audio Spectrogram Transformer. In INTEERSPEECH, 2021.
- Psla: Improving audio tagging with pretraining, sampling, labeling, and aggregation. TASLP, 2021.
- Uavm: A unified model for audio-visual learning. IEEE Signal Processing Letters, 2022.
- Contrastive audio-visual masked autoencoder. In ICLR, 2023.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Mavil: Masked audio-video learners. In NeurIPS, 2023.
- Perceiver io: A general architecture for structured inputs & outputs. In ICLR, 2022.
- Perceiver: General perception with iterative attention. In ICML, 2021.
- Audio-visual contrastive learning with temporal self-supervision. In AAAI, 2023.
- Cross-attentional audio-visual fusion for weakly-supervised action localization. In ICLR, 2021.
- Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In ICCV, 2021.
- Learning to answer questions in dynamic audio-visual scenarios. In CVPR, 2022.
- Scaling language-image pre-training via masking. In CVPR, 2023.
- Mst: Masked self-supervised transformer for visual representation. In NeurIPS, 2021.
- Polyvit: Co-training vision transformers on images, videos and audio. arXiv Preprint, 2021.
- Eclipse: Efficient long-range video retrieval using sight and sound. In ECCV, 2022.
- Vision transformers are parameter-efficient audio-visual learners. In CVPR, 2023.
- Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In NeurIPS, 2021.
- Audiovisual transformer with instance attention for audio-visual event localization. In ACCV, 2020.
- Exploring target representations for masked autoencoders. arXiv Preprint, 2022.
- Active contrastive learning of audio-visual video representations. In ICLR, 2021.
- Contrastive learning of global and local audio-visual representations. In NeurIPS, 2021.
- Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization. In WACV, 2023.
- A unified audio-visual learning framework for localization, separation, and recognition. In ICML, 2023.
- Class-incremental grouping network for continual audio-visual learning. In ICCV, 2023.
- Multi-modal grouping network for weakly-supervised audio-visual video parsing. In NeurIPS, 2022.
- Robust audio-visual instance discrimination. In CVPR, 2021.
- Audio-visual instance discrimination with cross-modal agreement. In CVPR, 2021.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. In NeurIPS, 2022.
- Attention bottlenecks for multimodal fusion. In NeurIPS, 2021.
- Representation learning with contrastive predictive coding. arXiv Preprint, 2018.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
- Ambient sound provides supervision for visual learning. In ECCV, 2016.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Janani Ramaswamy. What makes the sound?: A dual-modality interacting network for audio-visual event localization. In ICASSP, 2020.
- See the sound, hear the pixels. In WACV, 2020.
- Dual perspective network for audio-visual event localization. In ECCV, 2022.
- Zorro: the masked multimodal transformer. arXiv Preprint, 2023.
- Learning audio-visual speech representation by masked multimodal cluster prediction. In ICLR, 2022.
- Everything at once–multi-modal fusion transformer for video retrieval. In CVPR, 2022.
- Tvlt: Textless vision-language transformer. In NeurIPS, 2022.
- Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In CVPR, 2022.
- Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV, 2020.
- Audio-visual event localization in unconstrained videos. In ECCV, 2018.
- Clippo: Image-and-language understanding from pixels only. In CVPR, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
- Equivariance and invariance inductive bias for learning from insufficient data. In ECCV, 2022.
- One-peace: Exploring one general representation model toward unlimited modalities. arXiv Preprint, 2023.
- What makes training multi-modal classification networks hard? In CVPR, 2020.
- Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
- Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In CVPR, 2021.
- Dual attention matching for audio-visual event localization. In ICCV, 2019.
- Cross-modal background suppression for audio-visual event localization. In CVPR, 2022.
- Masked autoencoders that listen. In NeurIPS, 2022.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- Cross-modal attention network for temporal inconsistent audio-visual event localization. In AAAI, 2020.
- Learning visual representation from modality-shared contrastive language-image pre-training. In ECCV, 2022.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Audio-adaptive activity recognition across video domains. In CVPR, 2022.
- Contrastive learning relies more on spatial inductive bias than supervised learning: An empirical study. In ICCV, 2023.