Audio-Visual Segmentation (2207.05042v3)

Published 11 Jul 2022 in cs.CV, cs.MM, cs.SD, eess.AS, and eess.IV

Abstract: We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

Authors (10)

Jinxing Zhou (16 papers)
Jianyuan Wang (24 papers)
Jiayi Zhang (159 papers)
Weixuan Sun (31 papers)
Jing Zhang (731 papers)
Stan Birchfield (64 papers)
Dan Guo (66 papers)
Lingpeng Kong (134 papers)
Meng Wang (1063 papers)
Yiran Zhong (75 papers)

Citations (96)

View on Semantic Scholar

Summary

Audio-Visual Segmentation

The paper "Audio-Visual Segmentation" presents a novel problem within the field of audio-visual data integration, specifically focusing on generating pixel-level segmentation maps for objects that produce sound within a given video frame. Recognizing the intricate interplay between audio and visual cues in dynamic environments, the research addresses a gap in existing methodologies by introducing a fine-grained approach that transcends mere sound source localization.

Core Contribution and Methods

Central to the paper's contribution is the introduction of the AVSBench, a comprehensive benchmark specifically curated for audio-visual segmentation tasks. This benchmark is bifurcated into two distinct settings:

Semi-supervised Single Sound Source Segmentation (S4): Here, the focus is on videos with a single sound-emitting object. The aim is to leverage partial annotations for training while generating segmentations across all frames during testing.
Fully-supervised Multiple Sound Source Segmentation (MS3): This sets its sights on videos featuring multiple sound sources, providing full annotations for robust training.

The authors propose a method underpinned by a temporal pixel-wise audio-visual interaction (TPAVI) module, which facilitates a nuanced fusion of audio and visual modalities. Moreover, a regularization loss function is integrated to uphold the correlation between audio signals and pixel-wise segmentation accuracy.

Evaluation and Results

The benchmark, AVSBench, serves as the basis for evaluating the effectiveness of the proposed method against existing approaches from related fields such as sound source localization (SSL), video object segmentation (VOS), and salient object detection (SOD). The results are presented using metrics like the Jaccard index and F-score, demonstrating notable advancements, particularly in both semi-supervised and fully-supervised contexts.

The quantitative results reveal the proposed method's superiority, especially when dealing with scenes involving multiple dynamic sound sources. The incorporation of TPAVI markedly enhances segmentation precision, showcasing improvements over traditional systems that provide only patch-level or heat map localizations.

Implications and Future Directions

The implications of this work are twofold:

Practical Applications: The nuanced method offers significant potential for applications in multimedia processing, video editing, and augmented reality, where precise synchronization of audio-visual data is crucial.
Theoretical Insights: The integration of temporal interactions in pixel-wise segmentation opens avenues for further exploration of multi-modal learning paradigms in complex environments.

Looking ahead, enhancing model performance on unseen categories and expanding the dataset to include a richer variety of sound-producing objects may be worthwhile pursuits. Additionally, integrating more complex temporal dynamics could further refine segmentation accuracy.

In summary, this research elucidates a methodological advancement in the field of audio-visual data processing, providing a robust framework for future explorations in multimodal learning and its applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - OpenNLPLab/AVSBench: [ECCV 2022] Official implementation of the paper: Audio-Visual Segmentation (449 stars)