Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos (2307.04760v4)
Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.
- The conversation: Deep audio-visual speech enhancement. arXiv preprint arXiv:1804.04121, 2018.
- Self-supervised learning of audio-visual objects from video. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 208–224. Springer, 2020.
- Active speakers in context. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12462–12471, 2020.
- Look, listen and learn. 2017 IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Mae-ast: Masked autoencoding audio spectrogram transformer. In Interspeech, 2022.
- Soundspaces: Audio-visual navigation in 3d environments. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 17–36. Springer, 2020.
- Visual acoustic matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18858–18868, 2022a.
- Soundspaces 2.0: A simulation platform for visual-acoustic learning. Advances in Neural Information Processing Systems, 35:8896–8911, 2022b.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Easycom: An augmented reality dataset to support algorithms for easy communication in noisy environments. 2021.
- Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech, and Language Processing, 18(7):1830–1840, 2010.
- Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
- Hello! my name is… buffy” – automatic naming of characters in tv video. In British Machine Vision Conference, 2006.
- Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
- Learning joint statistical models for audio-visual fusion and segregation. In Advances in Neural Information Processing Systems. MIT Press, 2000.
- Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7053–7062, 2019.
- 2.5D visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 324–333, 2019a.
- Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3879–3888, 2019b.
- Visualvoice: Audio-visual speech separation with cross-modal consistency. arXiv preprint arXiv:2101.03149, 2021.
- Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
- Visualechoes: Spatial image representation learning through echolocation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 658–676. Springer, 2020.
- Audiovisual masked autoencoders. arXiv preprint arXiv:2212.05922, 2022.
- Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
- Contrastive audio-visual masked autoencoder. In The Eleventh International Conference on Learning Representations, 2023.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Gaussian error linear units (gelus). arXiv: Learning, 2016.
- Audio vision: Using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems. MIT Press, 1999.
- Deep learning for monaural speech separation. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1562–1566, 2014.
- Mavil: Masked audio-video learners. arXiv preprint arXiv:2212.08071, 2022a.
- Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022b.
- Epic-sounds: A large-scale dataset of actions that sound. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, pages 448–456, Lille, France, 2015. PMLR.
- Egocentric deep multi-channel audio-visual active speaker localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10544–10552, 2022.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- How to design a three-stage architecture for audio-visual active speaker detection in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1193–1203, 2021.
- Cooperative learning of audio and video models from self-supervised synchronization. Advances in Neural Information Processing Systems, 31, 2018.
- Maas: Multi-modal assignation for active speaker detection. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 265–274, 2021.
- SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Active audio-visual separation of dynamic sound sources. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pages 551–569. Springer, 2022.
- Move2hear: Active audio-visual source separation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 275–285, 2021.
- Few-shot audio-visual learning of environment acoustics. In Advances in Neural Information Processing Systems, 2022.
- Chat2map: Efficient scene mapping from multi-ego conversations. arXiv preprint arXiv:2301.02184, 2023.
- Learning long-term spatial-temporal graphs for active speaker detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pages 371–387. Springer, 2022.
- Localizing visual sounds the easy way. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 218–234. Springer, 2022.
- Self-supervised generation of spatial audio for 360°video. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
- Self-supervised generation of spatial audio for 360° video. Advances in Neural Information Processing Systems, 33:4733–4744, 2020.
- Audio-visual speech inpainting with deep learning. In ICASSP, 2021.
- Rectified linear units improve restricted boltzmann machines. In ICML 2010, pages 807–814, 2010.
- Real-time speaker localization and speech separation by audio-visual integration. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), pages 1043–1049. IEEE, 2002.
- Multimodal deep learning. In International Conference on Machine Learning, 2011.
- Egocom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020.
- Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–648, 2018.
- Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2405–2413, 2016a.
- Ambient sound provides supervision for visual learning. In European Conference on Computer Vision, 2016b.
- Motion informed audio source separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10, 2017.
- The kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011. IEEE Catalog No.: CFP11SRW-USB.
- Audio-visual object localization and separation using low-rank and sparsity. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2901–2905. IEEE, 2017.
- Audio-visual floorplan reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1183–1192, 2021.
- Localize to binauralize: Audio spatialization from visual sound source localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1930–1939, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Sdr – half-baked or well done? ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630, 2018.
- Egocentric auditory attention localization in conversations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14663–14674, 2023.
- Self-supervised learning for audio-visual relationships of videos with stereo sounds. IEEE Access, 10:94273–94284, 2022.
- Crossmodal learning for audio-visual speech event localization. ArXiv, abs/2003.04358, 2020.
- Charades-ego: A large-scale dataset of paired third and first person videos. CoRR, abs/1804.09626, 2018.
- Supervised and semi-supervised separation of sounds from single-channel mixtures. In Independent Component Analysis and Signal Separation, pages 414–421, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
- Source-filter based clustering for monaural blind source separation. In in Proceedings of International Conference on Digital Audio Effects DAFx’09, 2009.
- Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3927–3935, 2021.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. On Audio, Speech and Lang. Processing, 2007.
- Self-supervised learning of audio representations from audio-visual data using spatial alignment. IEEE Journal of Selected Topics in Signal Processing, 16(6):1467–1479, 2022.
- Telling left from right: Learning spatial correspondence of sight and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9932–9941, 2020.
- Camera pose estimation and localization with active audio sensing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pages 271–291. Springer, 2022.
- Unicon: Unified context network for robust active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3964–3972, 2021.
- Vision-infused deep audio inpainting. In ICCV, 2019.
- Blind separation of speech mixtures via time-frequency masking. In IEEE TRANSACTIONS ON SIGNAL PROCESSING (2002) SUBMITTED, 2004.
- Sagnik Majumder (17 papers)
- Ziad Al-Halah (27 papers)
- Kristen Grauman (136 papers)