Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes (2203.13412v1)

Published 25 Mar 2022 in cs.CV

Abstract: Sound source localization in visual scenes aims to localize objects emitting the sound in a given image. Recent works showing impressive localization performance typically rely on the contrastive learning framework. However, the random sampling of negatives, as commonly adopted in these methods, can result in misalignment between audio and visual features and thus inducing ambiguity in localization. In this paper, instead of following previous literature, we propose Self-Supervised Predictive Learning (SSPL), a negative-free method for sound localization via explicit positive mining. Specifically, we first devise a three-stream network to elegantly associate sound source with two augmented views of one corresponding video frame, leading to semantically coherent similarities between audio and visual features. Second, we introduce a novel predictive coding module for audio-visual feature alignment. Such a module assists SSPL to focus on target objects in a progressive manner and effectively lowers the positive-pair learning difficulty. Experiments show surprising results that SSPL outperforms the state-of-the-art approach on two standard sound localization benchmarks. In particular, SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best. Code is available at: https://github.com/zjsong/SSPL.

Authors (5)

Zengjie Song (7 papers)
Yuxi Wang (49 papers)
Junsong Fan (14 papers)
Tieniu Tan (119 papers)
Zhaoxiang Zhang (162 papers)

Citations (40)

View on Semantic Scholar

Summary

Self-Supervised Predictive Learning for Sound Source Localization

The paper "Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes" presents a novel approach to the task of sound source localization within visual environments, which involves identifying the objects in a scene that are responsible for producing sound. Traditional methods often employ contrastive learning frameworks that rely on both positive and negative pairs for alignment, yet the random sampling of negative pairs can lead to mismatches in audio-visual features and ambiguity in localization. This paper proposes a new methodology termed Self-Supervised Predictive Learning (SSPL), a strategy that pursues sound localization by emphasizing explicit positive mining.

The SSPL approach diverges from the conventional methods by eliminating the need for negative sampling, thereby mitigating the risk of introducing false negative errors. These errors often result from sampling sounds that, although deemed negative, actually belong to the same category as the positive samples, leading to semantic misalignment in the feature space. Instead, SSPL focuses on optimizing positive pairs by leveraging a three-stream network architecture. This network elegantly aligns audio cues with two distinct visual perspectives of a corresponding video frame.

The unique components of this architecture include a novel predictive coding module (PCM), designed to iteratively align audio and visual features progressively, thereby reducing the difficulty typically associated with positive-pair learning. Through this mechanism, SSPL aligns the visual and auditory modalities in a manner inspired by predictive coding theories from cognitive neuroscience.

In terms of methodological implementation, SSPL is benchmarked against contemporary state-of-the-art methods across two datasets, SoundNet-Flickr and VGG-Sound Source. The results demonstrate that SSPL notably outperforms the existing approaches, with substantial improvements of up to 8.6% in consensus Intersection over Union (cIoU) and 3.4% in Area Under Curve (AUC) metrics on SoundNet-Flickr. Such performance gains underscore the efficacy of a negative-free approach in sound localization tasks.

The implications of these findings are twofold. Practically, SSPL's framework has the potential to enhance a range of applications that rely on accurate sound source localization in multimedia interfaces and robotics. Theoretically, it prompts a re-evaluation of self-supervised learning strategies in multi-modal tasks, advocating for a reduction in dependency on negatively sampled pairs, which have historically been pivotal to contrastive learning mechanisms.

Looking forward, the methodology opens avenues for extending self-supervised learning frameworks to other audio-visual tasks and potentially exploring unsupervised scenarios that involve complex environments with multiple sound sources. While SSPL shows promising results for localizing single sound sources, further research is required to adapt this approach for scenarios where multiple auditory events occur simultaneously within a visual scene. Additionally, investigating the integration of SSPL with other self-supervised modalities, such as text or tactile data, could expand its applicability in multi-sensory processing architectures.

PDF Markdown

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes (2203.13412v1)

Summary

Self-Supervised Predictive Learning for Sound Source Localization

Related Papers

GitHub

YouTube