BigVGAN: A Universal Neural Vocoder with Large-Scale Training (2206.04658v2)

Published 9 Jun 2022 in cs.SD, cs.CL, cs.LG, and eess.AS

Abstract: Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tuning. We introduce periodic activation function and anti-aliased representation into the GAN generator, which brings the desired inductive bias for audio synthesis and significantly improves audio quality. In addition, we train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature. We identify and address the failure modes in large-scale GAN training for audio, while maintaining high-fidelity output without over-regularization. Our BigVGAN, trained only on clean speech (LibriTTS), achieves the state-of-the-art performance for various zero-shot (out-of-distribution) conditions, including unseen speakers, languages, recording environments, singing voices, music, and instrumental audio. We release our code and model at: https://github.com/NVIDIA/BigVGAN

Citations (181)

View on Semantic Scholar

Summary

The paper introduces BigVGAN, advancing GAN-based vocoders with periodic activations for enhanced audio synthesis.
It employs an anti-aliased multi-periodicity module and scales to 112M parameters, reducing artifacts and stabilizing training.
Empirical evaluations show improved metrics (M-STFT, MCD, PESQ) and robust performance in diverse out-of-distribution audio scenarios.

BigVGAN: A Comprehensive Evaluation of a Universal Neural Vocoder

The paper introduces BigVGAN, a significant advancement in GAN-based vocoders that successfully addresses the challenges associated with synthesizing high-fidelity audio across diverse speakers and environments. The authors propose a novel vocoder that efficiently generalizes to various out-of-distribution scenarios without the need for fine-tuning, a feat that has been challenging in previous vocoder implementations.

Core Contributions

Inductive Bias for Audio Synthesis: A pivotal innovation in BigVGAN is the introduction of periodic activation functions. These functions bring the desired inductive bias for audio synthesis, significantly enhancing audio quality. The periodic activations allow the model to effectively capture the multi-periodic structures inherent in audio waveforms.
Anti-aliased Representation: The authors incorporate an anti-aliased multi-periodicity composition (AMP) module into the generator. This module effectively reduces high-frequency artifacts by employing low-pass filters to maintain the integrity of signal components, ensuring a cleaner audio output.
Scaling of Model Parameters: BigVGAN distinguishes itself by scaling up to 112M parameters, a first in GAN vocoder literature. The model adeptly manages the failure modes associated with large-scale GAN training, achieving high-fidelity outputs without over-regularization or destabilization.

Numerical and Qualitative Evaluation

The empirical results presented in the paper are compelling. BigVGAN outperforms state-of-the-art models across several metrics, including Multi-resolution STFT (M-STFT), Mel-cepstral distortion (MCD), and Perceptual Evaluation of Speech Quality (PESQ). Notably, BigVGAN demonstrates significant improvements in periodicity errors and voiced/unvoiced classification accuracy, reflecting its robustness in various OOD scenarios, including unseen languages and recording environments.

Subjective evaluations reinforce these findings, as BigVGAN scores highly on mean opinion scores (MOS) and similarity mean opinion scores (SMOS). These results highlight its capacity to maintain speaker identity and audio quality across diverse settings and applications.

Theoretical Implications

From a theoretical perspective, BigVGAN's use of periodic activations in a GAN framework can pave the way for more generalized applications in time-series prediction and signal processing domains. The strategic combination of architecture sophistication with model scaling presents a new paradigm in audio synthesis, enhancing the potential for cross-domain applications.

Practical Implications and Future Directions

Practically, BigVGAN sets a new benchmark for universal vocoding, demonstrating high-speed synthesis capabilities, making it suitable for real-time applications in text-to-speech systems, speech-to-speech translation, and other areas requiring dynamic audio synthesis. The ability to adapt seamlessly to new languages and recording conditions without retraining is particularly advantageous, enabling broader and more versatile deployment in global applications.

Future developments could explore further integration of improved anti-aliased representations and optimization techniques to enhance training stability and efficiency. Additionally, leveraging BigVGAN for multimodal systems that synthesize audio alongside video or text could expand its applicability.

In conclusion, BigVGAN represents a substantial step forward in universal neural vocoding. By holistically addressing architectural, numerical, and optimization challenges, it sets a robust foundation for future research and deployment in versatile audio synthesis tasks.

PDF Markdown

Related Papers

GitHub

GitHub - NVIDIA/BigVGAN: Official PyTorch implementation of BigVGAN (ICLR 2023) (923 stars)