2000 character limit reached
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos (2404.05206v1)
Published 8 Apr 2024 in cs.CV, cs.MM, cs.SD, and eess.AS
Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video, outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In NeurIPS, 2021.
- Self-Supervised MultiModal Versatile Networks. In NeurIPS, 2020.
- Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, 2020.
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
- Look, listen and learn. In ECCV, 2018a.
- Objects that sound. In ECCV, 2018b.
- Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Visual acoustic matching. In CVPR, 2022.
- Vggsound: A large-scale audio-visual dataset. In ICASSP, 2020.
- Diffimpact: Differentiable rendering and identification of impact sounds. In 5th Annual Conference on Robot Learning, 2021.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- The relationship between precision-recall and roc curves. In ICML, 2006.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Clap: Learning audio concepts from natural language supervision. arXiv preprint arXiv:2206.04769, 2022.
- Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970, 2015.
- Threedworld: A platform for interactive multi-modal physical simulation. In NeurIPS Track on Datasets and Benchmarks, 2021.
- Swoosh! rattle! thump! - actions that sound. In RSS, 2022.
- 2.5d visual sound. In CVPR, 2019.
- VisualVoice: Audio-visual speech separation with cross-modal consistency. In CVPR, 2021.
- Visualechoes: Spatial image representation learning through echolocation. In ECCV, 2020a.
- Listen to look: Action recognition by previewing audio. In CVPR, 2020b.
- Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
- Imagebind: One embedding space to bind them all. In CVPR, 2023.
- AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575, 2021.
- Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10699–10709, 2022.
- Ego4d: Around the world in 3, 000 hours of egocentric video. In CVPR, 2022.
- Audioclip: Extending clip to image, text and audio, 2021.
- EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound. In IEEE International Conference on Acoustics, Speech, Signal Processing (ICASSP), 2023.
- Egocentric deep multi-channel audio-visual active speaker localization. In CVPR, 2022.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Slow-fast auditory streams for audio recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 855–859, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- In An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021.
- Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, 2018.
- Egocentric video-language pretraining. arXiv preprint arXiv:2206.01670, 2022.
- Active contrastive learning of audio-visual video representations. In ICLR, 2021.
- Audio captioning transformer. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), pages 211–215, Barcelona, Spain, 2021.
- Learning state-aware visual representations from audible interactions. In NeurIPS, 2022.
- Learning representations from audio-visual spatial alignment. In NeurIPS, 2020.
- Robust audio-visual instance discrimination. In Computer Vision and Pattern Recognition (CVPR), IEEE/CVF Conf. on, 2021.
- Attention bottlenecks for multimodal fusion, 2021.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Audio-visual scene analysis with self-supervised multisensory features. In ECCV, 2018.
- Visually indicated sounds. In CVPR, 2016.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
- Karol J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pages 1015–1018. ACM Press.
- On variational bounds of mutual information. In ICML, 2019.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.
- Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20020–20029, 2022.
- Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, 2021.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.
- Tcgm: An information-theoretic framework for semi-supervised multi-modality learning. In ECCV, 2020.
- A. Tharwat. Classification assessment methods. Applied Computing and Informatics, 17(1):168–192, 2021.
- Contrastive multiview coding. In ECCV, 2020.
- Representation learning with contrastive predictive coding. arxiv, 2018.
- Show and tell: A neural image caption generator. In CVPR, pages 3156–3164. IEEE Computer Society, 2015.
- Merlot reserve: Multimodal neural script knowledge through vision and language and sound. In CVPR, 2022.
- Learning video representations from large language models. In CVPR, 2023.
- Changan Chen (31 papers)
- Kumar Ashutosh (17 papers)
- Rohit Girdhar (43 papers)
- David Harwath (55 papers)
- Kristen Grauman (136 papers)