Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weakly-Supervised Audio-Visual Segmentation (2311.15080v1)

Published 25 Nov 2023 in cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Audio-visual segmentation. In European Conference on Computer Vision, 2022.
  2. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
  3. Self-supervised audio-visual co-segmentation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361, 2019.
  4. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
  5. Self-supervised learning of audio-visual objects from video. In Proceedings of European Conference on Computer Vision (ECCV), pages 208–224, 2020.
  6. See the sound, hear the pixels. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2959–2968, 2020.
  7. Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
  8. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
  9. Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), page 218–234, 2022.
  10. A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  11. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015.
  12. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1635–1643, 2015.
  13. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
  14. What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, pages 549–565. Springer, 2016.
  15. Learning random-walk label propagation for weakly-supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7158–7166, 2017.
  16. Bottom-up top-down cues for weakly-supervised semantic segmentation. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 263–277. Springer, 2017.
  17. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1568–1576, 2017.
  18. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
  19. Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1622–1631, 2021.
  20. Deep graph cut network for weakly-supervised semantic segmentation. Science China Information Sciences, 64(3):1–12, 2021.
  21. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European conference on computer vision, pages 695–711. Springer, 2016.
  22. Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7014–7023, 2018.
  23. Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7268–7277, 2018.
  24. Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5267–5276, 2019.
  25. Self-supervised difference detection for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5208–5217, 2019.
  26. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020.
  27. Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020.
  28. Tell me where to look: Guided attention inference network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9215–9223, 2018.
  29. Self-erasing network for integral object attention. Advances in Neural Information Processing Systems, 31, 2018.
  30. C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 989–998, 2022.
  31. Soundnet: Learning sound representations from unlabeled video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2016.
  32. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–816, 2016.
  33. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017.
  34. Cooperative learning of audio and video models from self-supervised synchronization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  35. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
  36. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1735–1744, 2019.
  37. Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10478–10487, 2020.
  38. Learning representations from audio-visual spatial alignment. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 4733–4744, 2020.
  39. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12934–12945, 2021.
  40. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12475–12486, June 2021.
  41. Semantic-aware multi-modal grouping for weakly-supervised audio-visual video parsing. In European Conference on Computer Vision (ECCV) Workshop, 2022.
  42. Benchmarking weakly-supervised audio-visual sound localization. In European Conference on Computer Vision (ECCV) Workshop, 2022.
  43. DiffAVA: Personalized text-to-audio generation with visual alignment. arXiv preprint arXiv:2305.12903, 2023.
  44. A unified audio-visual learning framework for localization, separation, and recognition. arXiv preprint arXiv:2305.19458, 2023.
  45. Audio-visual class-incremental learning. arXiv preprint arXiv:2308.11073, 2023.
  46. Class-incremental grouping network for continual audio-visual learning. 2023.
  47. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
  48. Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3879–3888, 2019.
  49. Listen to look: Action recognition by previewing audio. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10457–10467, 2020.
  50. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2745–2754, 2021.
  51. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15495–15505, 2021.
  52. Weakly-supervised audio-visual sound source detection and separation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
  53. Audio-visual grouping network for sound localization from mixtures. arXiv preprint arXiv:2303.17056, 2023.
  54. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
  55. Self-supervised generation of spatial audio for 360°video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  56. 2.5d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019.
  57. Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of European Conference on Computer Vision (ECCV), pages 17–36, 2020.
  58. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of European Conference on Computer Vision (ECCV), page 436–454, 2020.
  59. Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1326–1335, 2021.
  60. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  61. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  62. Learning sound localization better from semantically similar samples. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  63. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–12, 2021.
  64. Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 33:655–666, 2020.
  65. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 876–885, 2017.
  66. Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6913–6922, 2021.
  67. Audio event and scene recognition: A unified approach using strongly and weakly labeled data. 2017 International Joint Conference on Neural Networks (IJCNN), pages 3475–3482, 2016.
  68. Audio event detection using weakly labeled data. In Proceedings of the 24th ACM International Conference on Multimedia, page 1038–1047, 2016.
  69. A closer look at weak label learning for audio events. arXiv preprint arXiv:1804.09288, 2018.
  70. Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35, 2016.
  71. Class-conditional embeddings for music source separation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 301–305, 2019.
  72. Finding strength in weakness: Learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2386–2399, 2019.
  73. Improving universal sound separation using sound classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 96–100, 2020.
  74. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408, 2019.
  75. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3917–3926, 2019.
  76. Deep residual learning for image recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  77. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
Citations (11)

Summary

We haven't generated a summary for this paper yet.