Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network (2007.05663v3)

Published 11 Jul 2020 in eess.AS and cs.SD

Abstract: In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency ($F_{0}$) features are outside the $F_{0}$ range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary $F_{0}$ features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary $F_{0}$ features and the effectiveness of the cascaded structure for speech generation.

Citations (18)

Summary

  • The paper introduces Quasi-Periodic WaveNet (QPNet), an autoregressive model incorporating pitch-dependent dilated convolution neural networks (PDCNNs) to improve pitch controllability in waveform generation.
  • Experiments show QPNet significantly outperforms baseline WaveNet models in pitch accuracy for both sinusoidal and speech generation, especially with unseen pitch features.
  • QPNet's ability to achieve precise pitch control is crucial for applications like voice conversion and expressive speech synthesis, expanding the practical use of neural vocoders.

Quasi-Periodic WaveNet: Enhancing Pitch Controllability in Autoregressive Audio Generation Models

The paper "Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network" presents a novel approach to improving the pitch controllability of deep learning-based waveform generative models, particularly WaveNet, with applications in speech generation. The authors propose the Quasi-Periodic WaveNet (QPNet), which incorporates a pitch-dependent dilated convolutional neural network (PDCNN) to dynamically adjust network pathways based on pitch information, thereby addressing the pitch-controllability challenges encountered in conventional WaveNet vocoders.

Key Innovations

The innovations introduced in QPNet stem from the combination of pitch knowledge and autoregressive modeling to enhance the generative capacity and control in raw audio waveform synthesis:

  1. Pitch-dependent Dilated Convolution Neural Networks (PDCNNs): By adapting the dilation size of the convolutional layers based on auxiliary fundamental frequency (F0F_{0}) features, the network captures pitch-proximal information efficiently, extending the effective receptive field when necessary. This design effectively curbs pitch discrepancies seen when F0F_{0} values fall outside the training data's observed range.
  2. Cascaded Network Architecture: QPNet employs a dual-layered cascaded approach that models both the short-term and long-term dependencies in quasi-periodic signals, such as speech. The fixed section is responsible for local sample-level dependencies, while the adaptive section tunes to the periodic nature of the signal facilitated by PDCNNs.

Experimental Results

The efficacy of QPNet was rigorously tested through experiments on sinusoidal signal generation as well as speech generation:

  • Sinusoidal Evaluation: The paper presents extensive evaluations demonstrating that QPNet performs exceptionally well when compared to baseline WaveNet models in generating periodic waveforms with high accuracy, even under conditions where pitch information lies outside the training dataset. The PDCNN improved spectral and pitch accuracy by leveraging prior F0F_{0} knowledge.
  • Speech Generation Evaluation: In speech synthesis tasks, QPNet significantly outperformed both the compact-size and full-size WaveNet models in terms of pitch accuracy and U/VU/V decision error. Although the full-size WaveNet attained slightly better spectral modeling capacity (MCD) due to its complexity, QPNet achieved a balanced trade-off between size, computation, and output quality, especially with unseen auxiliary pitch features.

Implications and Future Directions

The introduction of QPNet is a notable contribution to the field of waveform generation and vocoder design, specifically targeting one of the limitations of generic WaveNet implementations — systematic pitch control. By embedding domain knowledge into network architecture, QPNet widens the applicability of neural vocoders to scenarios demanding precise pitch manipulation, which is critical for tasks like voice conversion and expressive speech synthesis.

Looking forward, improvements could focus on further optimizing the architecture to reduce computational overhead without sacrificing performance. Additionally, exploring variations of PDCNN within other neural network architectures could potentially yield improvements in related generative modeling domains like music synthesis or environmental sound processing.

In summary, the Quasi-Periodic WaveNet offers a promising direction for achieving high-quality, pitch-controllable audio synthesis through strategic convolutional adaptations and cascaded network structures. This work empowers applications that require finely-tuned pitch modulation, enhancing both theoretical understanding and practical implementations in audio signal processing.

Youtube Logo Streamline Icon: https://streamlinehq.com