Weakly-Supervised Audio-Visual Segmentation (2311.15080v1)
Abstract: Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
- Audio-visual segmentation. In European Conference on Computer Vision, 2022.
- Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
- Self-supervised audio-visual co-segmentation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2357–2361, 2019.
- Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
- Self-supervised learning of audio-visual objects from video. In Proceedings of European Conference on Computer Vision (ECCV), pages 208–224, 2020.
- See the sound, hear the pixels. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2959–2968, 2020.
- Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
- Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
- Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), page 218–234, 2022.
- A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015.
- Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1635–1643, 2015.
- Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
- What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, pages 549–565. Springer, 2016.
- Learning random-walk label propagation for weakly-supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7158–7166, 2017.
- Bottom-up top-down cues for weakly-supervised semantic segmentation. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 263–277. Springer, 2017.
- Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1568–1576, 2017.
- Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.
- Differentiable multi-granularity human representation learning for instance-aware human semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1622–1631, 2021.
- Deep graph cut network for weakly-supervised semantic segmentation. Science China Information Sciences, 64(3):1–12, 2021.
- Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European conference on computer vision, pages 695–711. Springer, 2016.
- Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7014–7023, 2018.
- Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7268–7277, 2018.
- Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5267–5276, 2019.
- Self-supervised difference detection for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5208–5217, 2019.
- Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020.
- Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020.
- Tell me where to look: Guided attention inference network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9215–9223, 2018.
- Self-erasing network for integral object attention. Advances in Neural Information Processing Systems, 31, 2018.
- C2am: Contrastive learning of class-agnostic activation map for weakly supervised object localization and semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 989–998, 2022.
- Soundnet: Learning sound representations from unlabeled video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–816, 2016.
- Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017.
- Cooperative learning of audio and video models from self-supervised synchronization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
- The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
- The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1735–1744, 2019.
- Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10478–10487, 2020.
- Learning representations from audio-visual spatial alignment. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 4733–4744, 2020.
- Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12934–12945, 2021.
- Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12475–12486, June 2021.
- Semantic-aware multi-modal grouping for weakly-supervised audio-visual video parsing. In European Conference on Computer Vision (ECCV) Workshop, 2022.
- Benchmarking weakly-supervised audio-visual sound localization. In European Conference on Computer Vision (ECCV) Workshop, 2022.
- DiffAVA: Personalized text-to-audio generation with visual alignment. arXiv preprint arXiv:2305.12903, 2023.
- A unified audio-visual learning framework for localization, separation, and recognition. arXiv preprint arXiv:2305.19458, 2023.
- Audio-visual class-incremental learning. arXiv preprint arXiv:2308.11073, 2023.
- Class-incremental grouping network for continual audio-visual learning. 2023.
- Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
- Co-separating sounds of visual objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3879–3888, 2019.
- Listen to look: Action recognition by previewing audio. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10457–10467, 2020.
- Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2745–2754, 2021.
- Visualvoice: Audio-visual speech separation with cross-modal consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15495–15505, 2021.
- Weakly-supervised audio-visual sound source detection and separation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
- Audio-visual grouping network for sound localization from mixtures. arXiv preprint arXiv:2303.17056, 2023.
- AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
- Self-supervised generation of spatial audio for 360°video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
- 2.5d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019.
- Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of European Conference on Computer Vision (ECCV), pages 17–36, 2020.
- Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of European Conference on Computer Vision (ECCV), page 436–454, 2020.
- Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1326–1335, 2021.
- Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Learning sound localization better from semantically similar samples. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–12, 2021.
- Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 33:655–666, 2020.
- Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 876–885, 2017.
- Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6913–6922, 2021.
- Audio event and scene recognition: A unified approach using strongly and weakly labeled data. 2017 International Joint Conference on Neural Networks (IJCNN), pages 3475–3482, 2016.
- Audio event detection using weakly labeled data. In Proceedings of the 24th ACM International Conference on Multimedia, page 1038–1047, 2016.
- A closer look at weak label learning for audio events. arXiv preprint arXiv:1804.09288, 2018.
- Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35, 2016.
- Class-conditional embeddings for music source separation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 301–305, 2019.
- Finding strength in weakness: Learning to separate sounds with weak supervision. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2386–2399, 2019.
- Improving universal sound separation using sound classification. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 96–100, 2020.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6399–6408, 2019.
- A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3917–3926, 2019.
- Deep residual learning for image recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.