BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis (2205.14807v2)

Published 30 May 2022 in eess.AS, cs.LG, and cs.SD

Abstract: Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.

Citations (50)

View on Semantic Scholar

Summary

The paper introduces a two-stage conditional diffusion model that decomposes mono audio into shared and channel-specific components for effective binaural synthesis.
The method significantly improves performance, reducing the Wave L2 score from 0.157 to 0.128 and increasing the MOS from 3.61 to 3.80 compared to baseline systems.
The framework paves the way for enhanced immersive audio applications in VR, gaming, and augmented media through high-fidelity, perceptually accurate soundscapes.

BinauralGrad: A Two-Stage Conditional Diffusion Model for Binaural Audio

The paper "BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis" introduces an innovative framework aimed at synthesizing binaural audio from mono audio inputs. The motivation behind this work is the cost and complexity associated with capturing binaural audio directly in real-world environments, thereby necessitating computational approaches for synthesis. The inherent challenges of the task stem from factors such as room reverberations and unique head-related acoustical characteristics that traditional DSP systems struggle to model accurately.

Framework Overview

The proposed BinauralGrad framework utilizes a two-stage process incorporating diffusion models to handle the synthesis of binaural audio. The authors decompose binaural audio into two components: a common part shared by both audio channels and a specific part that accounts for differences attributed to spatial and physiological effects. This decomposition aligns with the listening experience, where small perceptual differences underpin sound spatialization.

Stage One: This stage deploys a single-channel diffusion model conditioned on the mono audio to generate the common component of binaural audio. The diffusion model here focuses on capturing essential features shared between both ears' audio signals.

Stage Two: Building on the initial stage, a two-channel diffusion model is conditioned on the output from Stage One. This model focuses on synthesizing the binaural audio, capturing the nuanced differences between the left and right channels that contribute to spatial localization.

Performance and Results

The experimental evaluations conducted utilizing a benchmark dataset reveal that BinauralGrad surpasses existing methods by a notable margin. Specifically, metrics such as Wave L2 score improved from 0.157 to 0.128, and the Mean Opinion Score (MOS) increased from 3.61 to 3.80, highlighting both objective improvements and subjective perceptual gains compared to baseline methodologies like WarpNet and conventional DSP systems.

Implications and Future Work

The introduction of a two-stage conditional diffusion process represents a meaningful step forward in modeling complex audio transformations. The use of diffusion models, which have shown efficacy in high-fidelity synthesis tasks, underscores a shift towards leveraging deep generative models in audio processing.

In practical terms, the framework’s ability to produce high-fidelity binaural audio could significantly enhance experiences in virtual and augmented reality by generating realistic soundscapes without complex environmental recording setups. The notable improvements in perceptual quality also point to potential applications in gaming, simulation training, and immersive media.

Looking forward, potential areas for development include enhancing training regimes to address the existing gap between training and inference conditions, as well as optimizing the process for real-time application without compromising audio quality. Integrating personal HRTFs in a user-specific manner could further refine the spatial accuracy of synthesized audio, enhancing personalization in immersive audio experiences.

Overall, BinauralGrad offers a concrete advance in binaural audio synthesis, representing a fusion of theoretical reformulation and practical application through its robust two-stage approach. The paper lays fertile ground for further explorations into leveraging generative models for immersive and personalized audio rendering.

PDF Markdown