Visually Indicated Sounds (1512.08512v2)

Published 28 Dec 2015 in cs.CV, cs.LG, and cs.SD

Abstract: Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. In this paper, we propose the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene. We present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We show that the sounds predicted by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that they convey significant information about material properties and physical interactions.

Citations (366)

View on Semantic Scholar

Summary

The paper introduces a novel RNN-based method to synthesize impact sounds from silent videos.
It leverages unsupervised audiovisual data, matching predicted sound features to training exemplars for enhanced realism.
Validation through human studies and auditory metrics confirms the model’s ability to capture key sound characteristics.

Visually Indicated Sounds: Synthesizing Sounds from Silent Videos

The paper by Owens et al. explores the intriguing problem of synthesizing impact sounds from silent videos, a task that implicitly involves understanding material properties and physical dynamics within a visual scene. The authors propose a computational model that predicts the sounds generated when an object is struck, leveraging visual input to synthesize corresponding audio outputs. This research presents an insightful intersection of computer vision and audio analysis.

Methodology and Approach

The core methodology involves a recurrent neural network (RNN) tasked with predicting sound features from video frames. Utilizing silent footage of objects being contacted by a drumstick, the model extrapolates plausible sounds through a detailed generation process. The algorithm processes video inputs to predict audio features, subsequently converting these features into a sound waveform through either parametric or example-based synthesis. Here, the audio predictions are based on matching the learned features to exemplars in a training set, facilitating realistic sound generation.

Importantly, the model does not rely on manual annotation of material properties for training, instead discovering statistical correlations within the audiovisual data alone. This aligns with broader trends in representation learning, where unsupervised methods derive interpretable features without direct supervision.

Experimental Validation

The effectiveness of the sound synthesis was assessed through a combination of human psychophysical studies and automated metrics. The model's output sounds were sufficiently realistic to deceive human participants in a "real or fake" experiment, demonstrating its capability to generate credible audio cues. The model's performance on auditory attributes, such as loudness and spectral centroid, further corroborates its ability to capture essential sound characteristics.

Moreover, the research introduces the "Greatest Hits: Volume 1" dataset, a substantial compilation of videos specifically curated for studying the synthesis of visually indicated sounds. Spanning a variety of materials and environments, this dataset enables comprehensive analysis and testing of the proposed approach.

Implications and Future Directions

This approach offers several potential implications both in improving sound-related tasks in computer vision and inspiring new avenues for multisensory learning models. Practically, the research could enhance applications in automated media generation and virtual reality by providing synchronized audio-visual experiences. Theoretically, it suggests a novel proxy task for training feature representations that capture implicit physical properties of objects.

Speculatively, future developments could extend the approach to more complex scenarios involving dynamic interactions or different auditory environments. Further exploration could involve integrating more sophisticated physics simulations or expanding the model's capability to generalize across varied sound-producing actions. Additionally, examining cross-modal learning architectures that jointly optimize for both audio and visual information could yield even richer models of physical interaction.

Conclusion

Owens et al. present a compelling paper on generating impact sounds from silent videos, a task that not only requires understanding of the visual scene but also translation into the auditory domain. Through careful design and validation, the research outlines a model with promising applications and lays the groundwork for further studies bridging vision and sound. This work represents a significant step towards integrating sensory modalities in AI, with broad implications for both technical development and theoretical understanding.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zaesarius/status/1759296082805215512

YouTube

Show All Videos