LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (1810.11846v2)

Published 28 Oct 2018 in eess.AS, cs.LG, and cs.SD

Abstract: Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS. This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.

Citations (439)

View on Semantic Scholar

Summary

The paper introduces LPCNet, reducing computational demands in speech synthesis by integrating linear prediction with deep learning.
It leverages block-sparse GRU layers and dual fully-connected layers for efficient modeling of excitation signals.
Experimental evaluations show LPCNet delivers competitive audio quality at low complexity, enabling real-time synthesis on mobile devices.

LPCNet: Improving Neural Speech Synthesis Through Linear Prediction

The paper "LPCNet: Improving Neural Speech Synthesis Through Linear Prediction" presents an enhanced version of the WaveRNN architecture, named LPCNet. The primary objective is to address the high computational demands of real-time neural speech synthesis, enabling such technology to function on lower-power devices, including mobile phones and embedded systems.

The authors, Jean-Marc Valin from Mozilla and Jan Skoglund from Google LLC, propose a system that leverages the well-established technique of linear prediction, integrated into a deep learning framework. LPCNet achieves improved efficiency by offloading the task of spectral envelope modeling to traditional linear prediction techniques, thereby allowing the recurrent neural network to focus on modeling the excitation signal. This division of labor between linear prediction and neural networks is a key innovation of the paper.

Technical Contributions and Methodology

LPCNet advances on prior neural network-based models by incorporating:

Linear Prediction Coefficients (LPCs): These are used to predict the current sample from past samples using a simple linear filter. This separation allows the neural net to devote more capacity to model the excitation, leading to efficiency gains.
Sparse Matrices with Block Sparsity: The model employs block sparsity in its GRU layers, allowing for efficient vectorization while maintaining computational demands below 3 GFLOPS, which is substantially lower than similar models like WaveRNN and SampleRNN.
Pre-emphasis and Quantization: A first-order pre-emphasis filter is applied before quantizing the output signal. This step shapes noise distribution favoring high frequencies, thereby reducing perceived noise and enabling high-quality output using 8-bit μ-law quantization.
Dual Fully-Connected Layers: The architecture integrates a dual fully-connected layer that improves the probability estimation process when compared to a single layer.

The network employs a combination of convolutional layers at the frame rate and recurrent layers at the sample rate, resulting in an architecture that efficiently synthesizes audio at a lower computational cost. The system predicts excitation residuals instead of raw waveform samples, providing an easier task space for neural network modeling.

Evaluation and Implications

In subjective evaluations using MUSHRA testing, LPCNet not only outperformed an optimized WaveRNN variant with respect to perceived quality at equivalent computation budgets but also maintained comparable synthesis quality with significantly reduced complexity. The testing benchmark included real-world datasets excluding speech content from the test speakers, thereby affirming the model's speaker-independent capabilities.

The LPCNet model further implicates that the blend of classical signal processing methods with modern neural techniques holds substantial potential in future AI developments for efficient speech synthesis and encoding. The research opens avenues for further examination of integration techniques like long-term prediction to potentially drive down complexity even further.

Conclusion

This paper presents a coherent framework for merging traditional signal processing methods with neural networks to enhance the efficiency and practicality of neural speech synthesis. While LPCNet marks a significant step forward in real-time, low-complexity, and high-quality audio synthesis and compression, ongoing research may explore extending these concepts and additional methods of reducing artifacts through post-processing techniques.

Overall, the presented results suggest that LPCNet delivers an effective balance between performance and computational efficiency, making it a promising candidate for future deployment in mobile and other resource-constrained environments.

PDF Markdown