Zero-Shot Mono-to-Binaural Speech Synthesis (2412.08356v2)

Published 11 Dec 2024 in cs.SD, cs.LG, and eess.AS

Abstract: We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.

Summary

The paper presents ZeroBAS, a novel unsupervised framework that converts mono to binaural speech using geometric time warping, amplitude scaling, and a denoising vocoder.
It overcomes limitations of supervised methods by eliminating reliance on annotated binaural datasets and adapting to diverse acoustic environments.
Empirical evaluations show ZeroBAS achieves lower Wave ℓ2 errors and higher perceptual quality, demonstrating robust generalization on out-of-distribution data.

Overview of Zero-Shot Mono-to-Binaural Speech Synthesis

This paper presents a pioneering methodological contribution to the domain of audio synthesis, specifically addressing the challenge of converting monophonic audio recordings into binaural audio without preexisting binaural datasets—a technique referred to as ZeroBAS (zero-shot binaural audio synthesis). The authors propose a novel unsupervised approach leveraging geometric time warping (GTW), amplitude scaling (AS), and a monaural denoising vocoder, demonstrated through extensive experimentation to perform on par with, or better than, existing supervised methods in specific scenarios.

Challenges in Mono-to-Binaural Synthesis

The traditional approach to binaural synthesis utilizes supervised learning models, which rely heavily on position-annotated datasets. This reliance introduces several limitations, including the scarcity of such datasets, the necessity of specialized recording equipment, and the models’ susceptibility to overfitting to specific acoustic environments and speaker characteristics. To overcome these barriers, the proposed zero-shot framework circumvents the need for annotated binaural data, which is both cost-prohibitive and limited in diversity.

Methodology and Architectural Contributions

ZeroBAS employs a systematic architecture composed of three main components:

Geometric Time Warping (GTW): This step involves parameter-free computation to achieve effective channel separation by estimating interaural time delay based on the spatial source and listener configuration.
Amplitude Scaling (AS): This focuses on accurate amplitude manipulation by applying the inverse square law to enhance spatial perception, a feat achieved without incorporating HRTF or RIR data, making it broadly applicable across varied environments.
Denoising Vocoder: Utilizing the pre-trained WaveFit vocoder, this component iteratively refines the signal to remove artifacts and improve perceptual quality.

The uniqueness of this approach lies in its avoidance of modeling specific room acoustics or head-related transfer functions (HRTF), factors that many supervised models might inherently depend upon. Instead, it uses these foundational techniques to robustly generalize across environments with diverse acoustic properties, which was made evident through the introduction of the TUT Mono-to-Binaural dataset.

Empirical Evaluation

The authors conduct experiments utilizing the Binaural Speech dataset and the newly developed TUT Mono-to-Binaural dataset to validate the efficacy of ZeroBAS. In objective evaluations, ZeroBAS exhibits competitive performance versus state-of-the-art supervised methods—WarpNet, BinauralGrad, and NFS—especially on out-of-distribution data. Notably, ZeroBAS presents significantly lower Wave $\ell_2$ errors and demonstrates superior generalization capability as showcased by robust performance on the TUT dataset, where supervised models showed noticeable degradation.

Subjective metrics, like MOS (Mean Opinion Score) and MUSHRA (Multiple Stimuli with Hidden Reference and Anchor), further confirmed the perceptual naturalness of ZeroBAS-rendered audio, frequently surpassing supervised methods in listener studies, particularly under varying acoustic conditions.

Theoretical Insights and Implications

The paper's theoretical framework, especially the derivation and validation of expected phase and amplitude errors, underscores a well-founded methodological design with a focus on predictable error patterns in high-error regimes. Such insights are critical for accurately understanding model performance limits within different operational regimes.

Future Directions and Limitations

Despite significant advancements, ZeroBAS does have limitations, notably in handling precise phase information due to the vocoder's general-purpose design, which lacks explicit spatial conditioning. The research posits broader implications for real-world applications, especially in AR and VR domains, where adaptable binaural audio generation can enhance immersive experiences. Future exploration could involve integrating more nuanced room and head shape conditioning into the preprocessing stage or the vocoder itself, potentially elevating the spatial realism and extending generalizability.

In conclusion, the ZeroBAS method offers a robust alternative to supervised binaural synthesis approaches, capable of scaling across diverse acoustic spaces without extensive data dependencies, suggesting transformative potential for portable spatial audio applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/papers_anon/status/1867122628240216570

https://twitter.com/AudioAndSpeech/status/1867101385352785959