Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2005.05106v2)

Published 11 May 2020 in cs.SD and eess.AS

Abstract: In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation, which will be open-resourced shortly, can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

Authors (6)

Geng Yang (7 papers)
Shan Yang (58 papers)
Kai Liu (391 papers)
Peng Fang (5 papers)
Wei Chen (1290 papers)
Lei Xie (337 papers)

Citations (190)

View on Semantic Scholar

Summary

Multi-band MelGAN: A New Approach to Efficient Waveform Generation for Text-to-Speech

In this paper, the authors present Multi-band MelGAN, an advanced technique for expedite waveform generation aimed at improving the quality and efficiency of text-to-speech (TTS) systems. Unlike previous autoregressive (AR) models like WaveNet and WaveRNN, Multi-band MelGAN leverages non-AR architectures, which inherently provide faster inference times due to their parallelizable nature.

Key Enhancements to MelGAN

The authors introduce several crucial improvements to the existing MelGAN model:

Receptive Field Expansion: They significantly enlarge the receptive field of the generator. Unlike traditional models that managed smaller receptive fields, this expansion fosters better long-term audio dependencies, facilitating higher-quality speech generation.
Multi-Resolution STFT Loss: The feature matching loss of MelGAN is substituted with a multi-resolution Short-Time Fourier Transform (STFT) loss. This adjustment offers a more robust metric for gauging synthetic versus real speech discrepancies, enhancing both the training stability and the output quality.
Multi-band Processing: The paper's most significant contribution is the introduction of multi-band processing. By processing audio in sub-band signals that are later recombined into full-band signals, the model achieves an impressive reduction in computational complexity from 5.85 GFLOPS to 0.95 GFLOPS without sacrificing audio quality.

Performance Evaluation

The effectiveness of these approaches is demonstrated through empirical evaluations. Multi-band MelGAN achieves mean opinion scores (MOS) of 4.34 and 4.22 for waveform generation and TTS, respectively. These figures attest to its ability to deliver high-fidelity audio synthesis. Furthermore, the reduced parameter count to 1.91M implies a lightweight model that is efficient for real-time applications even when deployed on CPUs without specific hardware optimizations.

Practical and Theoretical Implications

Practically, the Multi-band MelGAN offers a substantial advancement for TTS systems, providing high-quality speech synthesis in a computationally efficient manner, making it suitable for deployment in resource-constrained environments such as mobile devices. Theoretically, it introduces an innovative multi-band approach that may stimulate further research into similar decompositions in waveform generation or other applications where parallel processing can be leveraged for efficiency.

Future Directions

The work opens up several future exploration avenues. It suggests the potential utility of further expanding the multi-resolution STFT loss evaluation across different domains within the same synthetic framework. Moreover, considering the model's architecture, investigating its application in multilingual TTS systems or cross-domain generative modeling, where traditional AR models lag, could uncover more hidden utilities. Finally, integrating these improvements into end-to-end TTS frameworks could further close the fidelity gap between human speech and synthesized outputs.

In conclusion, Multi-band MelGAN demonstrates significant progress in TTS system capabilities, reflecting both superior performance metrics and efficient computational profile. The ambitious improvements in speed and quality suggest meaningful applications across numerous audio and speech processing domains, providing a robust foundation for future research and development efforts in efficient waveform generation.

PDF Markdown