HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2010.05646v2)

Published 12 Oct 2020 in cs.SD, cs.LG, and eess.AS

Abstract: Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

PDF Abstract

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

The paper "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis" by Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae from Kakao Enterprise, introduces a method leveraging Generative Adversarial Networks (GANs) to achieve rapid and high-quality generation of speech waveforms. The proposed HiFi-GAN architecture addresses the limitations of existing autoregressive (AR) and flow-based models by improving both computational efficiency and synthesis fidelity.

Background and Motivation

The field of speech synthesis has witnessed rapid advancements with the development of neural networks. Standard approaches usually involve a two-stage pipeline: first, a low-resolution representation like mel-spectrograms is predicted from text, then raw audio waveforms are synthesized from these intermediate representations. Traditional AR models such as WaveNet, although known for high-quality output, suffer from significant latency due to their inherently sequential generation process. Flow-based models like WaveGlow and Parallel WaveNet have mitigated this issue to a degree by enabling parallel sampling but often come with trade-offs in terms of model complexity and parameter size.

The authors target the second stage of the synthesis pipeline, focusing on efficient and high-fidelity waveform generation from mel-spectrograms. The proposed HiFi-GAN method intends to bridge the gap in sample quality between GAN models and AR or flow-based counterparts.

Methodology

HiFi-GAN's architecture centers on a generator and two primary discriminators: the Multi-Scale Discriminator (MSD) and the Multi-Period Discriminator (MPD). The generator leverages a fully convolutional neural network, which takes mel-spectrograms as input and upsamples them through transposed convolutions until the temporal resolution matches that of the raw waveforms.

Multi-Receptive Field Fusion

A notable innovation in HiFi-GAN is its Multi-Receptive Field Fusion (MRF) module, which includes diverse residual blocks observing patterns of various lengths in parallel. This helps the generator model the periodic nature of speech signals more effectively.

Discriminators

The discriminators play a pivotal role in enhancing speech quality:

MSD: It assesses the audio samples at different scales. This is derived from the architecture used in MelGAN, which processes raw, 2x, and 4x average-pooled versions of the audio.
MPD: This consists of several sub-discriminators, each tasked with handling specific periodic components of the input audio by reshaping the one-dimensional audio into two-dimensional data based on set periods. This approach is designed to capture the diverse periodic patterns inherent in speech signals.

Training and Evaluation

The training objectives of HiFi-GAN include adversarial, mel-spectrogram, and feature matching losses, which jointly contribute to the model's learning process. The use of LSGAN avoids vanishing gradients, and the mel-spectrogram loss helps align the generator's output with the perceptual characteristics of human auditory perception.

The authors conducted extensive evaluations using the LJSpeech dataset. HiFi-GAN was compared against state-of-the-art models like WaveNet and WaveGlow, as well as MelGAN. Results from Mean Opinion Score (MOS) tests demonstrated that HiFi-GAN's V1 variant achieved a MOS score of 4.36, remarkably close to the ground truth at 4.45. Furthermore, HiFi-GAN synthesized high-fidelity audio significantly faster than real-time, outperforming its counterparts in both quality and efficiency metrics. The V3 model, designed for on-device applications, demonstrated impressive synthesis capabilities at 13.44 times real-time speed on a CPU.

Ablation Study

An ablation paper was performed to ascertain the contributions of MPD, MRF, and the mel-spectrogram loss. Removing any of these components resulted in noticeable declines in perceptual quality, underscoring their importance in the overall architecture.

Generalization and Practical Implications

HiFi-GAN was also tested on unseen speaker data from the VCTK dataset, and its performance remained robust, indicating good generalization capabilities. Furthermore, fine-tuning experiments for end-to-end speech synthesis (combined with Tacotron2) showed improved results, demonstrating the model's flexibility and potential for integration into broader TTS systems.

Conclusion and Future Directions

HiFi-GAN significantly advances the state-of-the-art in speech synthesis by providing a model that is both computationally efficient and capable of generating near-human quality audio. The adaptability of the discriminator architecture to different generator configurations without a need for recalibration suggests a versatile framework for various applications. Future work in AI could explore further optimizing HiFi-GAN for multilingual and emotion-aware speech synthesis or extending its capabilities to other domains requiring high-fidelity signal generation.

In summary, HiFi-GAN presents a substantial contribution to the field of speech synthesis, offering practical and theoretical advancements that pave the way for future innovations. The provision of their implementation as open-source further enhances the potential for ongoing research and development in this area.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jungil Kong (5 papers)
Jaehyeon Kim (16 papers)
Jaekyoung Bae (2 papers)

Citations (1,656)

View on Semantic Scholar

Related Papers

Find Related Papers

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (2010.05646v2)