Audio-Visual Scene Analysis with Self-Supervised Multisensory Features (1804.03641v2)

Published 10 Apr 2018 in cs.CV, cs.SD, and eess.AS

Abstract: The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory

Citations (728)

View on Semantic Scholar

Summary

The paper introduces a self-supervised method that fuses audio and visual streams using temporal misalignment detection to learn robust multisensory features.
The paper leverages a 3D CNN for early fusion, achieving 82.1% accuracy on UCF-101 and generating effective attention maps for sound source localization.
The paper demonstrates on/off-screen audio source separation that outperforms current methods by excelling in SDR, SIR, and SAR metrics.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

This paper by Andrew Owens and Alexei A. Efros proposes a method for joint modeling of audio and visual components of video signals using a fused multisensory representation. The presented model is trained in a self-supervised manner to detect temporal alignment between the audio and visual streams of a video, thereby learning multisensory features that are useful for several downstream tasks.

Methodology

The model is trained using raw audio and video streams from videos, where some of the streams are temporally aligned, and others are synthetically shifted by a few seconds. The network's task is to distinguish between these aligned and misaligned streams. This pretext task forces the network to integrate visual and audio information, leading to the discovery of useful audiovisual features.

The architecture involves a 3D convolutional neural network (CNN) that fuses video and audio streams early in the network to model actions producing signals in both modalities. The network reduces the temporal sampling rate of the video stream and aligns it with the reduced sampling rate of the audio stream before fusing them.

Applications and Results

Sound Source Localization:
- The network's learned attention maps can visualize the sources of sound in a video.
- The strongest CAM responses are related to faces and mouths, which are highly correlated with speech production.
- The paper presented qualitative visualizations that show the network's attention to various audiovisual stimuli.
Audio-Visual Action Recognition:
- The self-supervised representation was fine-tuned for action recognition using the UCF-101 dataset.
- The multisensory model achieved 82.1% accuracy, significantly outperforming other self-supervised methods.
On/Off-Screen Audio Source Separation:
- The model separates on-screen and off-screen audio using a $u$ -net encoder-decoder connected to the multisensory network.
- Results showed effective separation of sounds, demonstrated on both synthetic mixtures and real-world videos with simultaneous translations and interviews.
- The model outperformed both audio-only and existing audio-visual separation methods on various metrics including SDR, SIR, and SAR.

Key Insights and Implications

The findings highlight the potential of self-supervised methods in learning robust multisensory representations without manually labeled data. The use of a temporal misalignment detection task provides a challenging and effective signal for the network to learn intricate correlations between audio and visual streams. The proposed fused network architecture demonstrates the importance of early fusion for effectively modeling audiovisual information.

Future Directions

The research opens up several avenues for further exploration:

Developing alternative self-supervised tasks that build upon or complement the temporal alignment objective.
Extending the applicability of the learned multisensory representation to other audio-visual tasks beyond those discussed.
Improving the model's ability to handle more complex and dynamic environments, potentially incorporating additional sensory modalities.

Given its favorable results in sound localization, action recognition, and source separation, this approach could inspire broader adoption of self-supervised learning techniques in multisensory data analysis, ultimately enhancing the robustness and versatility of AI systems in understanding and interacting with our multisensory world.

PDF Markdown

Related Papers

YouTube

Show All Videos