Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks (1709.08041v1)

Published 23 Sep 2017 in cs.SD, cs.LG, and eess.AS

Abstract: A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks (DNNs) techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an over-smoothing effect often observed in the generated speech parameters. A GAN introduced in this paper consists of two neural networks: a discriminator to distinguish natural and generated samples, and a generator to deceive the discriminator. In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator. Since the objective of the GANs is to minimize the divergence (i.e., distribution difference) between the natural and generated speech parameters, the proposed method effectively alleviates the over-smoothing effect on the generated speech parameters. We evaluated the effectiveness for text-to-speech and voice conversion, and found that the proposed method can generate more natural spectral parameters and $F_0$ than conventional minimum generation error training algorithm regardless its hyper-parameter settings. Furthermore, we investigated the effect of the divergence of various GANs, and found that a Wasserstein GAN minimizing the Earth-Mover's distance works the best in terms of improving synthetic speech quality.

Authors (3)

Yuki Saito (47 papers)
Shinnosuke Takamichi (70 papers)
Hiroshi Saruwatari (100 papers)

Citations (196)

View on Semantic Scholar

Summary

Overview of Statistical Parametric Speech Synthesis with GANs

The paper presented by Saito et al. introduces a method for statistical parametric speech synthesis (SPSS) that leverages generative adversarial networks (GANs) to address issues related to synthetic speech quality, particularly over-smoothing effects. The research provides a comprehensive framework integrating GANs into acoustic model training, aiming to enhance the naturalness of speech synthesis outputs compared to conventional methods.

Technical Contributions

The primary contribution of this research is the introduction of GANs into the training process for acoustic models in SPSS. Conventional methods such as those that utilize deep neural networks (DNNs) for text-to-speech synthesis (TTS) and voice conversion (VC) suffer from over-smoothing, which deteriorates the quality of synthetic speech. By integrating GANs, where the generator aims to fool the discriminator into recognizing synthetic speech as natural, the authors propose an innovative training paradigm that minimizes distribution divergences between natural and generated speech parameters.

A discriminator within the GAN setup discerns between natural and synthetic speech parameters, while the acoustic models undergo adversarial training to minimize both conventional minimum generation error (MGE) and adversarial loss. This dual-objective approach helps to effectively alleviate the over-smoothing issue, resulting in synthetic speech spectra and pitch ( $F_0$ ) parameters that better mimic natural speech.

Key Findings

The research notably identifies the efficacy of the Wasserstein GAN (W-GAN) due to its focus on minimizing the Earth-Mover's distance (Wasserstein-1 distance), which was observed to provide the most substantial improvements in synthetic speech quality. Numerical evaluations demonstrate that speech synthesized through the proposed GAN-based methods surpasses conventional MGE methods in terms of naturalness as perceived by human listeners, with improved spectral parameter generation and prosody.

The results show a favorable spoofing rate across different hyper-parameter settings, wherein the GAN-based training demonstrated robustness against varying configurations without extensive hyper-parameter tuning. Additionally, comparisons across different GAN structures suggest that certain divergences, notably the KL and JS divergences, yield less desirable outcomes compared to the W-GAN and LS-GAN.

Implications and Future Directions

The implications of this work extend beyond the immediate improvements in the quality of speech synthesis. The integration of GANs into the parameter generation process introduces a sophisticated mechanism that transcends traditional Gaussian modeling by encapsulating complex distribution characteristics. This enhances the resilience and flexibility of SPSS techniques for diverse applications, including nuanced voice conversion tasks across different genders and linguistic contexts.

Moreover, the paper opens avenues for future explorations in machine learning and AI. Speculatively, subsequent research could delve into optimizing the interaction between discriminator designs and acoustic models, potentially incorporating linguistic dependencies and even latent variables to further improve model performances. There exists the potential for expansion of these techniques into real-time applications, such as interactive voice response systems, personal assistants, and multilingual text-to-speech systems.

In summary, the researchers provide a robust methodological advance in statistical parametric speech synthesis, demonstrating the benefit of generative adversarial networks in overcoming conventional hurdles in speech generation fidelity.

PDF Markdown

Related Papers

Find Related Papers