Efficient Neural Audio Synthesis (1802.08435v2)

Published 23 Feb 2018 in cs.SD, cs.LG, and eess.AS

Abstract: Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24kHz 16-bit audio 4x faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency.

Citations (840)

View on Semantic Scholar

Summary

The paper demonstrates WaveRNN, reducing sampling operations from 60 to 5 per sample while preserving high audio quality.
The paper introduces Sparse WaveRNN, showing that models with over 96% sparsity outperform smaller dense networks and enable mobile real-time synthesis.
The paper presents Subscale WaveRNN, a parallel generation scheme that produces 16 samples per step, paving the way for efficient GPU inference.

Efficient Neural Audio Synthesis: An Expert Overview

Kalchbrenner et al. present an in-depth exploration of the challenges and solutions associated with efficient neural audio synthesis in their paper titled "Efficient Neural Audio Synthesis." Their research focuses on optimizing the sampling efficiency of sequential generative models, particularly in the domain of text-to-speech (TTS) synthesis, while maintaining high-quality audio output.

Introduction and Problem Setting

Sequential models, such as RNNs and CNNs, excel in tasks involving audio, textual, and visual data, achieving state-of-the-art performance by leveraging their ability to factorize the data distribution into a product of conditional probabilities. However, a significant limitation of these models is the slow and computationally intensive sampling process. The authors aim to address the inefficiency by proposing three innovative techniques that enhance the sampling speed without degrading audio quality.

WaveRNN: A Compact and Efficient Model

The authors introduce WaveRNN, a single-layer recurrent neural network (RNN) equipped with a dual softmax layer. This model matches the audio quality of the WaveNet model while achieving a significant reduction in the number of operations per sample. Specifically, WaveRNN requires only 5 sequential operations for each 16-bit audio sample, compared to 60 operations in WaveNet. WaveRNN generates 24kHz 16-bit audio at a rate of 96,000 samples per second on a Nvidia P100 GPU, which is 4 times faster than real-time output.

Sparse WaveRNN: Balancing Parameters and Performance

To further enhance efficiency, the authors apply weight pruning techniques to WaveRNN, creating Sparse WaveRNN models that significantly reduce the number of non-zero weights. They find that large sparse networks outperform smaller dense networks at equivalent parameter counts, maintaining high audio fidelity even at sparsity levels above 96%. This sparsity allows real-time audio synthesis on mobile CPUs, achieving remarkable performance on platforms like Snapdragon 808 and 835.

Subscale WaveRNN: Parallelizing Sequential Generation

The final proposed technique is the Subscale WaveRNN, which introduces a novel generation scheme based on subscaling. The long audio sequence is divided into shorter sequences, allowing the generation of multiple samples in parallel. This approach increases sampling efficiency by producing up to 16 samples per step without sacrificing quality. The Subscale WaveRNN architecture supports batched sampling, paving the way for real-time synthesis on GPUs.

Experimental Results

The authors validate their models on a text-to-speech benchmark using a dataset of North American English speech. The evaluation metrics include Negative Log-Likelihood (NLL), Mean Opinion Scores (MOS), and A/B comparison tests rated by human listeners. Key quantitative results include:

WaveRNN-896 achieves an NLL of 5.42 and MOS of 4.37, comparable to the state-of-the-art WaveNet model.
Sparse WaveRNN with 96% sparsity retains high audio quality and achieves real-time performance on mobile CPUs.
Subscale WaveRNN maintains equal audio fidelity to WaveRNN-896 while generating 16 samples per step in parallel.

Implications and Future Directions

The implications of this research extend beyond TTS synthesis. The methodologies proposed by the authors, particularly WaveRNN and its sparse and subscale variants, provide a foundation for real-time inference in other sequential generation tasks, such as music synthesis and video frame generation. The success of block-sparse models suggests that further exploration into different sparsity structures could yield even more efficient architectures. Additionally, the subscale dependency scheme opens avenues for parallel processing in other domains where sequential models are traditionally bottlenecked by the serial nature of their sampling process.

Conclusion

Kalchbrenner et al. contribute significantly to the field of neural audio synthesis by addressing the bottleneck of inefficient sampling in sequential models. Their development of WaveRNN, Sparse WaveRNN, and Subscale WaveRNN demonstrates that it is possible to achieve high-quality audio synthesis with computational efficiency, thereby enabling real-time applications on a range of devices from high-performance GPUs to mobile CPUs. The approaches detailed in their work suggest promising future research directions in optimizing sequential generative models for various artificial intelligence applications.

PDF Markdown

Related Papers

YouTube

Show All Videos