Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual to Sound: Generating Natural Sound for Videos in the Wild

Published 4 Dec 2017 in cs.CV | (1712.01393v2)

Abstract: As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.

Citations (194)

Summary

  • The paper proposes a deep learning methodology utilizing video encoders and sound generators to synthesize natural audio for videos in the wild, introducing the VEGAS dataset for training.
  • Three model variations are presented, with the flow-based method incorporating optical flow features showing superior performance in generating temporally aligned and accurate audio.
  • Evaluations using metrics and human judgment confirm the model learns intricate audio-visual connections, demonstrating significant potential for virtual reality and accessibility applications.

Visual to Sound: Generating Natural Sound for Videos in the Wild

The paper, "Visual to Sound: Generating Natural Sound for Videos in the Wild," addresses the problem of generating sound from visual input, expanding the capabilities of audiovisual processing models with applications in virtual reality and accessibility. The study presents a methodology for producing realistic audio synchronously aligned with video content from a diverse dataset of scenes containing natural sounds. Such advancements could automate audio generation for immersive virtual settings and enhance experiences for individuals with visual impairments by translating visual cues into audio.

Methodology

The authors propose three variations of a deep learning model utilizing a video encoder and sound generator architecture:

  1. Frame-to-Frame Method: This approach encodes visual data using a VGG19 network for feature extraction and directly concatenates these features with the coarsest tier of a hierarchically structured recurrent neural network (RNN) known as SampleRNN. The frame representations are repeated to match the longer audio sequences.
  2. Sequence-to-Sequence Method: This model employs an RNN to encode the sequence of video frame features into a compact representation, which initializes the hidden states of the sound generator's coarsest RNN tier. This method aims to learn the complex alignment between audio and visual modalities implicitly.
  3. Flow-Based Method: To capitalize on motion cues, this variation adds optical flow-based features to capture subtle movements critical for sound synchronization. Flow features are combined with VGG19-based frame encodings and processed by an RNN. This method yielded the best temporal alignment of generated sounds.

The sound generation is powered by SampleRNN, chosen for its ability to handle long sequences via its hierarchical structure. This generator synthesizes audio at a high sampling rate (16 kHz), producing raw waveform samples directly, which translates to a more authentic and synchronized soundscape.

Dataset

The study constructs the Visually Engaged and Grounded AudioSet (VEGAS), a dataset curated from AudioSet, filtered and annotated for tasks requiring precise video-to-audio correspondence. This dataset encompasses 28,109 short video clips across 10 categories, ensuring integrity in both video and audio modalities through human verification.

Evaluation

The study evaluates the models using a combination of numerical assessments and human judgment:

  • Loss Metrics: Cross-entropy loss benchmarks reveal the flow-based method's competitive performance in generating coherent audio.
  • Retrieval Task: For further assessment, a retrieval experiment was designed, where visual features were used to identify the correct audio from a pool. Results showed the flow model performed best in aligning audio with visual input.
  • Human Evaluation: Through subjective assessments on video-audio correctness and synchronization, the flow-based method consistently outperformed others, indicating its superior capability in generating temporally coherent sounds.

Implications and Future Directions

The results demonstrate that neural networks can indeed learn the intricate connection between video and audio, supporting applications across multiple domains. The implications for virtual reality are profound, allowing for self-sufficient creation of immersive experiences. Moreover, the potential to convert visuals into auditory information could greatly enhance life quality and communication for those with visual impairments.

Future research could explore integrating higher-level object recognition and reasoning, thereby enriching sound generation by adding contextual awareness that spans beyond immediate visual cues. Additionally, extending these methods to work on multisensory integration tasks could improve the model's ability to operate in more complex, dynamic environments.

This paper contributes significantly to the field of audiovisual processing, providing a foundation and framework for subsequent studies aimed at further uniting the senses through computational means.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.