Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning (2403.09502v2)

Published 14 Mar 2024 in cs.LG, cs.AI, and cs.MM

Abstract: Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33:9758–9770, 2020.
  2. Look, listen and learn. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  609–617, 2017.
  3. Grounding spoken words in unlabeled video. In CVPR Workshops, volume 2, 2019.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  5. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In International Conference on Computer Vision (ICCV), 2021.
  6. Vggsound: A large-scale audio-visual dataset. In IEEE international conference on acoustics, speech and signal processing, pp.  721–725. IEEE, 2020a.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020b.
  8. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  15750–15758, 2021.
  9. Equivariant self-supervised learning: Encouraging equivariance in representations. In International Conference on Learning Representations, 2022.
  10. Equimod: An equivariance module to improve visual instance discrimination. In The Eleventh International Conference on Learning Representations, 2023.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  12. Large scale audiovisual learning of sounds with weakly labeled data. arXiv preprint arXiv:2006.01595, 2020.
  13. Self-supervised learning of split invariant equivariant representations. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  14. Audio set: An ontology and human-labeled dataset for audio events. In IEEE international conference on acoustics, speech and signal processing, pp.  776–780. IEEE, 2017.
  15. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  16144–16154, October 2023.
  16. Ast: Audio spectrogram transformer. In Interspeech, pp.  571–575, 2021.
  17. Contrastive audio-visual masked autoencoder. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=QPtMRyk5rb.
  18. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  19. Jointly learning visual and auditory speech representations from raw data. In The Eleventh International Conference on Learning Representations, 2022.
  20. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9729–9738, 2020.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  22. Masked autoencoders that listen. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  23. MAVil: Masked audio-video learners. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OmTMaTbjac.
  24. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.  4651–4664. PMLR, 2021.
  25. Audio-visual contrastive learning with temporal self-supervision. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press, 2023. ISBN 978-1-57735-880-0. doi: 10.1609/aaai.v37i7.25967. URL https://doi.org/10.1609/aaai.v37i7.25967.
  26. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in neural information processing systems, volume 31, 2016.
  27. Improving transferability of representations via augmentation-aware self-supervision. Advances in Neural Information Processing Systems, 34:17710–17722, 2021.
  28. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  29. Active contrastive learning of audio-visual video representations. In International Conference on Learning Representations, 2021a.
  30. Contrastive learning of global and local video representations. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021b.
  31. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12934–12945, 2021a.
  32. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12475–12486, 2021b.
  33. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34:14200–14213, 2021.
  34. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, pp.  69–84. Springer, 2016.
  35. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  36. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision, pp.  631–648, 2018.
  37. Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech, pp.  2613–2617, 2019.
  38. On compositions of transformations in contrastive self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9577–9587, October 2021.
  39. Piczak, K. J. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp.  1015–1018, 2015.
  40. Broaden your views for self-supervised video learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1255–1265, 2021.
  41. Self-supervised audio-visual representation learning with relaxed cross-modal synchronicity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  9723–9732, 2023.
  42. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems, 35:10078–10093, 2022.
  43. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  44. Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807, 2021.
  45. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12695–12705, 2020.
  46. Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. ArXiv, abs/1804.03209, 2018. URL https://api.semanticscholar.org/CorpusID:4719239.
  47. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5288–5296, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jongsuk Kim (5 papers)
  2. Hyeongkeun Lee (3 papers)
  3. Kyeongha Rho (5 papers)
  4. Junmo Kim (90 papers)
  5. Joon Son Chung (106 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets