Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching (2010.05466v1)

Published 12 Oct 2020 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Discriminatively localizing sounding objects in cocktail-party, i.e., mixed sound scenes, is commonplace for humans, but still challenging for machines. In this paper, we propose a two-stage learning framework to perform self-supervised class-aware sounding object localization. First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes. Then, class-aware object localization maps are generated in the cocktail-party scenarios by referring the pre-learned object knowledge, and the sounding objects are accordingly selected by matching audio and visual object category distributions, where the audiovisual consistency is viewed as the self-supervised signal. Experimental results in both realistic and synthesized cocktail-party videos demonstrate that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes. Code is available at https://github.com/DTaoo/Discriminative-Sounding-Objects-Localization.

PDF Abstract

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

The paper "Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching" addresses the challenge of identifying and locating sounding objects within complex auditory and visual environments, akin to the "cocktail-party" effect which humans adeptly navigate. The research introduces a sophisticated two-stage framework aimed at class-aware localization of sounding objects using self-supervised learning techniques.

Key Contributions

Two-stage Learning Framework: The framework is bifurcated into learning stages: initially, it derives robust object representations by aggregating sound localization results in constrained, single-source scenarios. Subsequently, these representations inform class-aware localization maps in cocktail-party settings, leveraging pre-learned object knowledge.
Audiovisual Consistency as Self-supervision: The research utilizes audiovisual harmony as a form of self-supervision. In the absence of semantic annotations, this method facilitates category distribution matching between visual and audio inputs, thus refining the localization of sounding objects across diverse categories.
Experimental Validation: Through extensive testing on both realistic and synthetic datasets, the approach is demonstrated to surpass existing methodologies in effectively filtering silent objects and accurately pinpointing the locations of sound-generating objects.

Methodological Insights

Self-supervised Audiovisual Matching: The novel approach circumvents the need for manual annotations, typically essential in similar tasks. By exploiting AV matching, the framework achieves localization through inherent synchronization in audiovisual inputs, a reflection of human perceptual capabilities.
Dictionary Learning for Object Representation: The framework constructs a representation dictionary through clustering-based learning from initial, simpler audio-visual pair datasets, facilitating generalized object detection in more complex scenes.
Performance on Diverse Data Sets: Evaluations on MUSIC and AudioSet-instrument datasets substantiate the approach's efficacy, showcasing notable improvements in Class-aware IoU and NSA metrics over state-of-the-art methods.

Implications and Future Directions

The implications of this research span both practical applications and theoretical advancements in AI. Practically, the method provides a stepping stone for developing systems capable of nuanced interaction with environments laden with multiple audio and visual signals, like autonomous vehicles or assistive devices for the hearing impaired. Theoretically, it underscores the potential of self-supervised learning frameworks in managing complex multimodal sensory data without the crutch of extensive labeled datasets.

Future research could enhance this work by exploring adaptive learning mechanisms that further improve object detection accuracy in entirely novel environments or by expanding the approach beyond single modality of dictionaries to a more integrated multimodal learning strategy. Additionally, integrating this approach with real-time processing enhancements could broaden its applicability to interactive and dynamic real-world systems.

The work establishes a robust foundation for further exploration into human-like machine perception, contributing significantly to the ongoing development of artificial intelligence in visually and acoustically complex settings.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Di Hu (88 papers)
Rui Qian (50 papers)
Minyue Jiang (6 papers)
Xiao Tan (75 papers)
Shilei Wen (42 papers)
Errui Ding (156 papers)
Weiyao Lin (87 papers)
Dejing Dou (112 papers)

Citations (125)

View on Semantic Scholar

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching (2010.05466v1)