Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis (2306.00814v3)

Published 1 Jun 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Citations (52)

View on Semantic Scholar

Summary

The paper presents a novel GAN-based Fourier vocoder that directly models spectral coefficients to enhance audio quality and computational efficiency.
It leverages ConvNeXt blocks and maintains constant feature resolution to reduce artifacts and accelerate synthesis compared to time-domain methods.
Evaluations using UTMOS, PESQ, and VISQOL demonstrate competitive performance and robust audio quality across diverse acoustic scenarios.

Vocos: Advancements in Fourier-Based Neural Vocoding

The paper "Vocos: Closing the Gap between Time-Domain and Fourier-Based Neural Vocoders for High-Quality Audio Synthesis" by Hubert Siuzdak presents a novel approach to neural vocoding by leveraging the advantages of Fourier-based time-frequency representations. This research aims to enhance audio synthesis quality and efficiency by directly generating Fourier spectral coefficients.

Background and Challenges

Neural vocoders have primarily relied on time-domain approaches such as Generative Adversarial Networks (GANs), which, while successful, often neglect the potential benefits offered by time-frequency representations. Fourier-based models align more closely with human auditory perception and utilize fast computation algorithms but historically face challenges in phase reconstruction. The phase spectrum's periodic nature presents inherent complexities, leading to unpredictable manipulation results. This paper seeks to address these issues by focusing on directly modeling Fourier spectral coefficients.

Vocos and its Contributions

Vocos introduces a GAN-based model that departs from traditional architectures. Instead of utilizing transposed convolutions for upsampling, it maintains constant feature resolution throughout the network and employs the inverse Fourier transform for waveform reconstruction. This isotropic architecture significantly reduces computational overhead, enhancing the synthesis speed by over an order of magnitude compared to time-domain methods.

The model utilizes ConvNeXt blocks, which effectively model spatially local input patterns, offering superior performance over conventional ResBlocks with dilated convolutions. By strategically maintaining low temporal resolution and leveraging a frequency-aware generator, Vocos attains state-of-the-art audio quality without compromising efficiency.

Comparative Evaluations

The paper thoroughly evaluates Vocos against established models such as HiFi-GAN, iSTFTNet, and BigVGAN using objective metrics like UTMOS and PESQ, as well as subjective measures like MOS and SMOS. Vocos shows competitive performance, especially in VISQOL and PESQ scores, and demonstrates reduced periodicity artifacts common in time-domain ensembles. It successfully generalizes to out-of-distribution data, exhibiting robust performance across diverse acoustic scenarios.

Implications and Future Directions

Vocos presents substantial implications for the field of neural vocoding, suggesting that Fourier-based representations, when properly harnessed, can outperform traditional time-domain strategies in both quality and computational efficiency. Its introduction of novel techniques for phase modeling and magnitude estimation offers promising directions for future research. Additionally, the model's open-source availability encourages further development and experimentation, potentially impacting a broad range of audio synthesis and reconstruction tasks.

Future developments could explore deeper integrations with advanced learning paradigms, potentially enhancing Fourier-based models' adaptability and robustness. Moreover, Vocos' performance in general audio coding contexts, as illustrated by its adaptation to neural audio codecs, signals potential growth into other domains, such as end-to-end text-to-speech applications.

Conclusion

Overall, this paper makes significant contributions by proposing a Fourier-based framework that reconciles the demands of high-quality audio synthesis with practical computational considerations. Vocos stands as a testament to the potential of alternative spectro-temporal models in enhancing the fidelity and efficiency of neural vocoders, laying the groundwork for future innovations in this dynamic field.

PDF Markdown

Related Papers

GitHub

GitHub - gemelo-ai/vocos: Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis (649 stars)

Tweets

https://twitter.com/ArxivSound/status/1796029857937121747

https://twitter.com/AudioAndSpeech/status/1796095078227689892