T-VSL: Text-Guided Visual Sound Source Localization in Mixtures (2404.01751v2)
Abstract: Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main
- Self-supervised learning of audio-visual objects from video. In Proceedings of European Conference on Computer Vision (ECCV), pages 208–224, 2020.
- Clip-decoder: Zeroshot multilabel classification using multimodal clip aligned representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4675–4679, 2023.
- Multimodal video retrieval with clip: a user study. Information Retrieval Journal, 26(1-2):6, 2023.
- Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
- Localized vision-language matching for open-vocabulary object detection. In DAGM German Conference on Pattern Recognition, pages 393–408. Springer, 2022.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
- Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
- Prompt switch: Efficient clip adaptation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15648–15658, 2023.
- Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
- Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022.
- Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence, pages 6568–6576, 2022.
- Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2000.
- Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
- Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023.
- Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 746–754, 2023.
- Esresne (x) t-fbsp: Learning robust time-frequency transformation of audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
- Audio vision: Using audio-visual synchrony to locate sounds. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 1999.
- Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
- Discriminative sounding objects localization via self-supervised audiovisual matching. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 10077–10087, 2020.
- Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10483–10492, 2022.
- Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023.
- Pixels that sound. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
- Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
- Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), page 218–234, 2022a.
- A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Audio-visual grouping network for sound localization from mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10565–10574, 2023.
- Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19413–19423, 2023a.
- Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19413–19423, 2023b.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Can clip help sound source localization? arXiv preprint arXiv:2311.04066, 2023a.
- Marginnce: Robust sound localization with a negative margin. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023b.
- Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
- Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446–19455, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
- Learning sound localization better from semantically similar samples. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- Sound source localization is all about cross-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7777–7787, 2023.
- Learning audio-visual source localization via false negative aware contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2023.
- Clipn for zero-shot ood detection: Teaching clip to say no. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1802–1812, 2023.
- Tao Wang. Learning to detect and segment for open vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7051–7060, 2023.
- Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2935–2944, 2023.
- A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
- The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
- Tanvir Mahmud (14 papers)
- Yapeng Tian (80 papers)
- Diana Marculescu (64 papers)