Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation (2312.06462v2)

Published 11 Dec 2023 in cs.CV, cs.AI, cs.SD, and eess.AS

Abstract: Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision, pages 609–617, 2017.
  2. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems, 29, 2016.
  3. Leveraging foundation models for unsupervised audio-visual segmentation. arXiv preprint arXiv:2309.06728, 2023.
  4. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  5. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876, 2021.
  6. Generative semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7111–7120, 2023.
  7. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
  8. Rethinking atrous convolution for semantic image segmentation. arxiv 2017. arXiv preprint arXiv:1706.05587, 2, 2019.
  9. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  10. Masked-attention mask transformer for universal image segmentation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022a.
  11. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022b.
  12. Sstvos: Sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5912–5921, 2021.
  13. Avsegformer: Audio-visual segmentation with transformer. arXiv preprint arXiv:2307.01146, 2023.
  14. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017.
  15. Improving audio-visual segmentation with bidirectional generation. arXiv preprint arXiv:2308.08288, 2023.
  16. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  17. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
  18. Discriminative sounding objects localization via self-supervised audiovisual matching. Advances in Neural Information Processing Systems, 33:10077–10087, 2020.
  19. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2989–2998, 2023.
  20. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
  21. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  22. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023a.
  23. Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation. arXiv preprint arXiv:2309.09709, 2023b.
  24. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10965–10975, 2022.
  25. Hear to segment: Unmixing the audio to guide the semantic segmentation, 2023.
  26. Bavs: Bootstrapping audio-visual segmentation by integrating foundation knowledge. arXiv preprint arXiv:2308.10175, 2023.
  27. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  28. Making a case for 3d convolutions for object segmentation in videos. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7-10, 2020. BMVA Press, 2020.
  29. Generative transformer for accurate and reliable salient object detection. arXiv preprint arXiv:2104.10127, 2021.
  30. Contrastive conditional latent diffusion for audio-visual segmentation. arXiv preprint arXiv:2307.16579, 2023a.
  31. Multimodal variational auto-encoder based audio-visual segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 954–965, 2023b.
  32. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571, 2016.
  33. Av-sam: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836, 2023.
  34. Multiple sound sources localization from coarse to fine. In Computer Vision – ECCV 2020, pages 292–308, Cham, 2020. Springer International Publishing.
  35. Segmenter: Transformer for semantic segmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7242–7252, 2021.
  36. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  37. Prompting segmentation with sound is generalizable audio-visual source localizer. arXiv preprint arXiv:2309.07929, 2023.
  38. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  39. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
  40. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  41. Learning generative vision transformer with energy-based latent space for saliency prediction. In Advances in Neural Information Processing Systems, pages 15448–15463. Curran Associates, Inc., 2021.
  42. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6230–6239, 2017.
  43. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6877–6886, 2021.
  44. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  45. Audio–visual segmentation. In European Conference on Computer Vision, pages 386–403. Springer, 2022.
  46. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023.
  47. Deformable {detr}: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Qi Yang (112 papers)
  2. Xing Nie (5 papers)
  3. Tong Li (197 papers)
  4. Pengfei Gao (24 papers)
  5. Ying Guo (61 papers)
  6. Cheng Zhen (9 papers)
  7. Pengfei Yan (15 papers)
  8. Shiming Xiang (54 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.