Audio-Visual Segmentation with Semantics (2301.13190v1)

Published 30 Jan 2023 in cs.CV

Abstract: We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires generating semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on AVSBench compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench. Online benchmark is available at http://www.avlbench.opennlplab.cn.

Authors (11)

Jinxing Zhou (16 papers)
Xuyang Shen (23 papers)
Jianyuan Wang (24 papers)
Jiayi Zhang (159 papers)
Weixuan Sun (31 papers)
Jing Zhang (731 papers)
Stan Birchfield (64 papers)
Dan Guo (66 papers)
Lingpeng Kong (134 papers)
Meng Wang (1063 papers)
Yiran Zhong (75 papers)

Citations (31)

View on Semantic Scholar

Summary

The paper introduces a novel AVS task that segments sound-producing objects at the pixel level using integrated audio and visual cues.
It presents the AVSBench dataset with distinct subsets for binary and semantic segmentation in both single- and multi-source scenarios.
The baseline method features an encoder-decoder architecture and a TPAVI module with regularization loss to enhance audio-visual correspondence.

Audio-Visual Segmentation with Semantics: A Comprehensive Overview

The paper "Audio-Visual Segmentation with Semantics" introduces a novel task within the field of multi-modal perception, namely Audio-Visual Segmentation (AVS). This work centers on the challenge of generating pixel-level segmentation for objects producing sound, leveraging both visual and auditory cues. Below is a detailed examination of the paper, considering its methodologies, contributions, and implications.

Problem Formulation and Dataset

The authors address the AVS task by proposing a pioneering dataset, AVSBench, designed specifically to facilitate research within this domain. The AVSBench is meticulously structured into three subsets:

AVSBench-object (Single-source subset, Multi-sources subset): This focuses on binary segmentation masks for both single-source and multiple-sources settings, assessing the model's ability to delineate sounding objects from a scene.
AVSBench-semantic: This subset extends the task to include semantic segmentation, requiring the model to categorize each pixel corresponding to its specific sound-producing object.

The dataset introduces a significant advancement in audio-visual research by providing pixel-level annotations in audible videos, allowing for a detailed exploration of sound and vision correlation.

Methodology

The authors propose a baseline method to address the task of AVS. This baseline utilizes a temporal pixel-wise audio-visual interaction (TPAVI) module, which plays a crucial role in facilitating the integration of audio cues into the visual segmentation process. The method involves:

Encoder-Decoder Architecture: The framework leverages standard convolutional or transformer-based encoders to process the visual data, alongside VGGish for audio feature extraction.
TPAVI Module: This is an innovative aspect of the model that allows for temporal pixel-wise mapping between audio and visual modalities, enhancing the model's capacity to interpret intricate audio-visual correlations.
Regularization Loss: A novel loss function is employed to promote effective audio-visual mapping, thereby improving the coherence between the segmented visual features and the auditory input.

Experimental Results

The paper undertakes a thorough evaluation on the AVSBench dataset, comparing the proposed baseline against existing methodologies from related fields like Sound Source Localization (SSL) and Video Object Segmentation (VOS). Results show that the proposed method, aided by the TPAVI module, achieves superior performance, particularly in handling complex multi-source scenarios effectively.

Implications and Future Directions

The implications of this research extend into both theoretical and practical dimensions:

Theoretical Implications: The task of AVS prompts a reconsideration of multi-modal learning paradigms, emphasizing the nuanced interplay between audio and visual signals. The integration of semantic segmentation further advances the complexity and applicability of these models.
Practical Applications: This research facilitates enhancements in fields like autonomous systems, surveillance, and media editing, where discerning sound sources at a granular level can provide substantial benefits.

The authors speculate that future developments could focus on scaling the dataset and exploring synthetic data for pre-training, thus broadening the potential for robust audio-visual learning.

In summary, "Audio-Visual Segmentation with Semantics" provides a foundational framework for addressing intricate multi-modal segmentation tasks. Through the creation of AVSBench and the introduction of a capable baseline methodology, this paper lays the groundwork for advances in understanding and leveraging the symbiosis of auditory and visual modalities in machine perception tasks.

PDF Markdown

Related Papers

Improving Audio-Visual Segmentation with Bidirectional Generation (2023)
Audio-Visual Segmentation (2022)
Cross-modal Cognitive Consensus guided Audio-Visual Segmentation (2023)
3D Audio-Visual Segmentation (2024)
Audio Visual Segmentation Through Text Embeddings (2025)

GitHub

GitHub - OpenNLPLab/AVSBench: [ECCV 2022] Official implementation of the paper: Audio-Visual Segmentation (441 stars)