- The paper introduces a self-supervised deep learning method to generate spatial audio from monaural audio and 360-degree video.
- The approach employs a neural network architecture for audio-visual analysis, sound source separation, localization, and first-order ambisonics synthesis.
- Experiments show the method's effectiveness and superiority over baselines, using novel datasets and paving the way for enhanced immersive audio applications.
Self-Supervised Generation of Spatial Audio for 360 Video
The paper introduces a novel approach for converting monaural audio recorded with 360-degree video into spatial audio. This task is pivotal for enhancing the immersive experience in 360-degree video consumption, where spatial audio plays a critical role in rendering a realistic auditory environment. The proposed method addresses the lack of consumer-grade spatial audio recording facilities by leveraging deep learning techniques to generate spatial audio from traditionally available mono audio by using 360-degree video as guidance.
Overview
The authors emphasize the importance of spatial audio in 360-degree video applications, citing its ability to significantly enhance the viewer's immersive experience. Despite the growing popularity of 360-degree video, the authors point out the scarcity of spatial audio content due to technical challenges and costs associated with capturing spatial audio. This paper presents a self-supervised learning approach to bridge this gap.
Using ground-truth spatial audio for supervision, the system is trained to predict the placement of sound sources in space using only mono audio and the corresponding video frames. The process is divided into two main challenges: separating sound sources and localizing these sources spatially.
Methodology
The method exploits a sophisticated neural network architecture designed to handle spatialization tasks efficiently. The architecture encompasses four main components:
- Audio-Visual Analysis: Extracting relevant audio and visual features using CNN architectures to create a robust understanding of the video content and audio signals.
- Sound Source Separation: Isolating individual sound sources from a mixed audio track using a U-Net-style decoder guided by the previously extracted features.
- Localization: Computing weights to appropriately localize each separated audio source within the viewing sphere.
- Spatial Audio Generation: Combining separated and localized sources through first-order ambisonics (FOA) synthesis to produce spatial audio output.
Experimental Setup
The team conducts thorough experiments using two newly introduced datasets, bringing both controlled and in-the-wild footage and audio recordings:
- Rec-Street: An in-house recorded dataset with uniform capture conditions.
- YT-All: A diverse collection of YouTube 360 videos curated to cover a wide array of environments, which further comprises subcategories like YT-Music and YT-Clean for more targeted evaluation.
Evaluation metrics were not limited to typical audio comparisons but expanded into spatial audio-specific measures, including envelope and earth mover’s distance, to validate spatial localization proficiency of the generated audio.
Key Findings
Experiments demonstrated the superiority of the proposed approach over existing architectures like U-Nets for this specific task domain. Not only does the paper establish the viability of spatial audio conversion from mono tracks, but its architecture effectively uses the visual domain to enhance audio processing capabilities.
Implications and Future Directions
This work sets a foundation for future exploration in self-supervised audio spatialization and offers a model that other researchers could utilize as a baseline or building block for further pursuits. Beyond immediate applications, the implication of such work can expand into enhancing legacy content, providing realistic audio for virtual and augmented reality, and even enabling high-order ambisonics through another layer of network augmentation.
The paper also signals a potential new avenue in synthesizing high-order ambisonic data, essential for future-proofing emerging auditory technologies.
By providing code and datasets to the community, the authors ensure that their contributions can act as a catalyst for further advancements in understanding and manipulating spatial audio, ultimately pushing the boundaries of what is achievable through self-supervision and deep learning in audio-visual integration.