MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (1910.06711v3)

Published 8 Oct 2019 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

PDF Abstract

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The paper "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis" advances the domain of raw audio waveform synthesis using GANs. This work presents MelGAN, a non-autoregressive, fully-convolutional GAN architecture tailored to efficiently simulate and reproduce high-quality audio signals based on input mel-spectrograms. The authors delineate notable improvements over previous GAN models by introducing architectural modifications and innovative training strategies.

Introduction and Motivation

Modelling raw audio waves remains complex due to their high temporal resolution and mixed-time dependencies. Most existing techniques avoid direct waveform synthesis; they instead use intermediate representations like mel-spectrograms and then revert these into waveforms. The MelGAN approach targets the latter stage, i.e., the mel-spectrogram inversion, leveraging the dynamic advantages of GANs. Specifically, the paper addresses the efficiency constraints and quality issues found in traditional signal-processing approaches, autoregressive models, and flow-based non-autoregressive methods.

Key Contributions

Introduction of MelGAN: The paper introduces MelGAN, which reliably generates coherent waveforms via a fully convolutional feed-forward model. This marks a salient contrast from autoregressive and flow-based approaches because it bypasses the need for distillation or perception loss functions and still achieves high-quality synthesis.
Enhanced Efficiency: MelGAN shows a striking performance in terms of computational speed. It runs over 100x faster than real-time on GPUs and over 2x faster on CPUs, achieving synthesis speeds up to 2500 kHz on a GTX 1080Ti GPU, significantly outperforming state-of-the-art models like WaveGlow.
Generalization and Parallelizability: Demonstrably, MelGAN's architecture generalizes to unseen speakers and can be used for tasks such as text-to-speech synthesis, universal music translation, and unconditional music synthesis.

Model Architecture

Generator

MelGAN's generator is a fully convolutional network that accepts mel-spectrogram input and outputs raw waveforms. The model employs:

Transposed Convolutions: These layers upscale the input temporal resolution.
Residual blocks with Dilated Convolutions: Placed after each upsampling layer to ensure long-range correlation in generated waveforms. This setup prevents artifacts such as "checkerboard" patterns typical in deconvolutional layers.
Weight Normalization: This normalization method retains critical pitch information and allows scalable training dynamics without adversely impacting sample quality.

Discriminator

The discriminator employs a multi-scale architecture with three nested discriminators operating at different audio resolutions. It utilizes window-based objectives, effectively classifying small overlapping audio chunks, enhancing the ability to maintain coherence across the synthesized waveform.

Training and Results

Training Objective

The paper advocates for a hinge loss-based GAN objective, supplemented with a feature matching loss to enforce close similarity between real and generated audio features, calculated across all discriminator layers.

Quantitative Evaluation

Through extensive MOS (Mean Opinion Score) testing, MelGAN performed comparably to high-capacity autoregressive models like WaveNet and WaveGlow in terms of audio quality. Moreover, ablation studies revealed the importance of specific architectural choices:

Multi-scale discriminators and dilated convolutions were essential in preventing high-frequency artifacts.
Weight normalization was critical for preserving audio sample quality.

Broader Implications and Future Work

In practical applications, MelGAN's substantial speed and quality make it a compelling choice for integrating within broader audio synthesis pipelines, including real-time text-to-speech systems. The model's architecture suggests that GANs can not only replace autoregressive models but do so more efficiently and with fewer computational resources.

The paper hints at potential future explorations, particularly the extension towards unconditional GANs for audio synthesis and further improvements in quality by refining training dynamics and architecture. This lays fertile ground for broadening the applicability and integration of GANs in audio synthesis tasks, signaling significant strides towards more efficient synthesis techniques.

Conclusion

MelGAN sets a new benchmark in the efficient, high-quality synthesis of raw audio waveforms from intermediate representations. Its compact, fully-convolutional architecture and the training strategies convey transformational opportunities within the auditory domain, marking a pivotal progress in the employment of GANs for conditional waveform synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Kundan Kumar (55 papers)
Rithesh Kumar (12 papers)
Thibault de Boissiere (1 paper)
Lucas Gestin (1 paper)
Wei Zhen Teoh (2 papers)
Jose Sotelo (4 papers)
Yoshua Bengio (601 papers)
Aaron Courville (201 papers)
Alexandre de Brebisson (5 papers)

Citations (899)

View on Semantic Scholar

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (1910.06711v3)