Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation (2312.06462v2)

Published 11 Dec 2023 in cs.CV, cs.AI, cs.SD, and eess.AS

Abstract: Recently, an audio-visual segmentation (AVS) task has been introduced, aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene, posing significant challenges. In this paper, we propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Regarding pixel entanglement, we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement, we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Code and more results will be publicly available at https://yannqi.github.io/AVS-COMBO/.

References (47)

Authors (8)

Qi Yang (112 papers)
Xing Nie (5 papers)
Tong Li (197 papers)
Pengfei Gao (24 papers)
Ying Guo (61 papers)
Cheng Zhen (9 papers)
Pengfei Yan (15 papers)
Shiming Xiang (54 papers)

Citations (7)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Tweets

https://twitter.com/ArxivSound/status/1777547931073024240

https://twitter.com/AudioAndSpeech/status/1777629579793846449

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation (2312.06462v2)

Summary

Related Papers

GitHub

Tweets