Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation (2312.06462v2)
Abstract: Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.
- Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
- Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
- Leveraging foundation models for unsupervised audio-visual segmentation. arXiv preprint arXiv:2309.06728, 2023.
- Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
- Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021.
- Generative semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7111–7120, 2023.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
- Rethinking atrous convolution for semantic image segmentation. arxiv 2017. arXiv preprint arXiv:1706.05587, 2, 2019.
- Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
- Masked-attention mask transformer for universal image segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022a.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022b.
- Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5912–5921, 2021.
- Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146, 2023.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017.
- Improving audio-visual segmentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
- Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33:10077–10087, 2020.
- Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2989–2998, 2023.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023a.
- Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023b.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10965–10975, 2022.
- Hear to segment: Unmixing the audio to guide the semantic segmentation, 2023.
- Bavs: Bootstrapping audio-visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023.
- Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
- Making a case for 3d convolutions for object segmentation in videos. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7-10, 2020. BMVA Press, 2020.
- Generative transformer for accurate and reliable salient object detection. arXiv preprint arXiv:2104.10127, 2021.
- Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579, 2023a.
- Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 954–965, 2023b.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571, 2016.
- Av-sam: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
- Multiple sound sources localization from coarse to fine. In Computer Vision – ECCV 2020, pages 292–308, Cham, 2020. Springer International Publishing.
- Segmenter: Transformer for semantic segmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7242–7252, 2021.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
- Prompting segmentation with sound is generalizable audio-visual source localizer. arXiv preprint arXiv:2309.07929, 2023.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Learning generative vision transformer with energy-based latent space for saliency prediction. In Advances in Neural Information Processing Systems, pages 15448–15463. Curran Associates, Inc., 2021.
- Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, 2017.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6877–6886, 2021.
- Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
- Audio–visual segmentation. In European Conference on Computer Vision, pages 386–403. Springer, 2022.
- Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023.
- Deformable {detr}: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.
- Qi Yang (112 papers)
- Xing Nie (5 papers)
- Tong Li (197 papers)
- Pengfei Gao (24 papers)
- Ying Guo (61 papers)
- Cheng Zhen (9 papers)
- Pengfei Yan (15 papers)
- Shiming Xiang (54 papers)