Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures (2404.01751v2)

Published 2 Apr 2024 in cs.CV, cs.SD, and eess.AS

Abstract: Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object, particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance, which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation, in this paper, we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by predicting the class of sounding entities in mixtures. Subsequently, the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Self-supervised learning of audio-visual objects from video. In Proceedings of European Conference on Computer Vision (ECCV), pages 208–224, 2020.
  2. Clip-decoder: Zeroshot multilabel classification using multimodal clip aligned representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4675–4679, 2023.
  3. Multimodal video retrieval with clip: a user study. Information Retrieval Journal, 26(1-2):6, 2023.
  4. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
  5. Localized vision-language matching for open-vocabulary object detection. In DAGM German Conference on Pattern Recognition, pages 393–408. Springer, 2022.
  6. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  7. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
  8. Prompt switch: Efficient clip adaptation for text-video retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15648–15658, 2023.
  9. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
  10. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022.
  11. Zero-shot out-of-distribution detection based on the pre-trained model clip. In Proceedings of the AAAI conference on artificial intelligence, pages 6568–6576, 2022.
  12. Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2000.
  13. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  14. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023.
  15. Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 746–754, 2023.
  16. Esresne (x) t-fbsp: Learning robust time-frequency transformation of audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  17. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
  18. Audio vision: Using audio-visual synchrony to locate sounds. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 1999.
  19. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
  20. Discriminative sounding objects localization via self-supervised audiovisual matching. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 10077–10087, 2020.
  21. Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10483–10492, 2022.
  22. Winclip: Zero-/few-shot anomaly classification and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19606–19616, 2023.
  23. Pixels that sound. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
  24. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  25. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  26. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pages 23033–23044. PMLR, 2023.
  27. Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
  28. Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), page 218–234, 2022a.
  29. A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  30. Audio-visual grouping network for sound localization from mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10565–10574, 2023.
  31. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19413–19423, 2023a.
  32. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19413–19423, 2023b.
  33. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  34. Can clip help sound source localization? arXiv preprint arXiv:2311.04066, 2023a.
  35. Marginnce: Robust sound localization with a negative margin. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023b.
  36. Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
  37. Freeseg: Unified, universal and open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19446–19455, 2023.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
  41. Learning sound localization better from semantically similar samples. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  42. Sound source localization is all about cross-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7777–7787, 2023.
  43. Learning audio-visual source localization via false negative aware contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6420–6429, 2023.
  44. Clipn for zero-shot ood detection: Teaching clip to say no. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1802–1812, 2023.
  45. Tao Wang. Learning to detect and segment for open vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7051–7060, 2023.
  46. Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE, 2022.
  47. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2935–2944, 2023.
  48. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1020–1031, 2023.
  49. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
  50. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tanvir Mahmud (14 papers)
  2. Yapeng Tian (80 papers)
  3. Diana Marculescu (64 papers)
Citations (3)