Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network (2007.12955v3)

Published 25 Jul 2020 in eess.AS and cs.SD

Abstract: In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency ($F_{0}$) feature such as a scaled $F_{0}$. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature. Both objective and subjective experimental results show that QPPWG outperforms PWG when the auxiliary $F_{0}$ feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure.

Authors (5)

Yi-Chiao Wu (42 papers)
Tomoki Hayashi (42 papers)
Takuma Okamoto (4 papers)
Hisashi Kawai (29 papers)
Tomoki Toda (106 papers)

Citations (19)

View on Semantic Scholar

Summary

The paper presents a novel QPPWG that integrates pitch-dependence via PDCNNs to enhance pitch control in waveform generation.
It dynamically adjusts receptive fields to model both harmonic and non-harmonic speech components, boosting overall speech quality.
Experimental evaluations demonstrate that QPPWG achieves superior pitch accuracy and efficiency compared to traditional models.

Analysis of Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-Dependent Dilated Convolution Neural Network

The paper describes a novel contribution to the field of neural vocoding, specifically focusing on improving pitch controllability in waveform generation. The research introduces the Quasi-Periodic Parallel WaveGAN (QPPWG), which is a non-autoregressive model leveraging pitch-dependent dilated convolution neural networks (PDCNNs) to integrate pitch information into the parallel WaveGAN (PWG) framework.

Methodology and Core Innovations

The QPPWG builds on the existing PWG, a compact GAN-based model that generates high-fidelity speech rapidly due to its non-autoregressive and non-causal design. One of the primary limitations of PWG has been its lack of pitch controllability for unseen auxiliary fundamental frequency (F0) features, limiting its application in scenarios demanding manipulation of pitch.

To address this, the QPPWG introduces a quasi-periodic (QP) structure, a significant enhancement allowing the network architecture to adjust dynamically according to input F0 features. This is achieved through PDCNNs, which modify the network's receptive fields based on pitch data. By integrating adaptive modules that handle periodic components and fixed ones for non-periodic components in a stacked architecture, QPPWG effectively models both harmonic and non-harmonic parts of speech signals.

Experimental Evaluation

Objective and subjective evaluations were methodically conducted to assess QPPWG's performance relative to baseline PWG and QPNet models. Various configurations of QPPWG were explored, focusing on hyperparameters such as network depth, dilation cycles, and block configurations. The results from these evaluations indicated:

The QPPWG model shows superior pitch accuracy and improved speech quality, particularly when dealing with pitch-scaled scenarios—areas where traditional PWG models tend to falter.
Despite QPPWG's reduction in model size compared to PWG, it demonstrates comparable or even superior performance in generative tasks, showcasing efficient model design.
Analysis of intermediate outputs revealed the system's ability to distinctly model different components of speech, leading to enhanced tractability and interpretability.

Implications and Future Directions

The implications of this work, both practical and theoretical, are notable. The proposed QPPWG model reinforces the potential of incorporating pitch informatics into neural vocoders, demonstrating a more nuanced control over the speech generation process. The use of non-autoregressive PDCNN represents a scalable approach applicable to various neural network-based models requiring pitch adaptation.

The research opens new avenues in high-quality speech synthesis and vocoding, addressing a crucial need for better pitch manipulation tools in real-time applications. Future research may explore the integration of QPPWG with other GAN-based vocoders and extend its applications across multi-lingual and multi-speaker scenarios. Furthermore, model optimization and reduction of computational costs while maintaining generative capacity remain pivotal areas for ongoing investigation.

Overall, this research contributes significantly to neural vocoding techniques, presenting advancements that could reshape the landscape of speech synthesis by enhancing both performance and capability of existing models.

PDF Markdown

Related Papers

YouTube

Show All Videos