Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FloWaveNet : A Generative Flow for Raw Audio (1811.02155v3)

Published 6 Nov 2018 in cs.SD and eess.AS

Abstract: Most modern text-to-speech architectures use a WaveNet vocoder for synthesizing high-fidelity waveform audio, but there have been limitations, such as high inference time, in its practical application due to its ancestral sampling scheme. The recently suggested Parallel WaveNet and ClariNet have achieved real-time audio synthesis capability by incorporating inverse autoregressive flow for parallel sampling. However, these approaches require a two-stage training pipeline with a well-trained teacher network and can only produce natural sound by using probability distillation along with auxiliary loss terms. We propose FloWaveNet, a flow-based generative model for raw audio synthesis. FloWaveNet requires only a single-stage training procedure and a single maximum likelihood loss, without any additional auxiliary terms, and it is inherently parallel due to the characteristics of generative flow. The model can efficiently sample raw audio in real-time, with clarity comparable to previous two-stage parallel models. The code and samples for all models, including our FloWaveNet, are publicly available.

Analysis of FloWaveNet: A Generative Flow for Raw Audio

FloWaveNet presents a novel approach to raw audio synthesis within text-to-speech systems, set apart by its utilization of generative flow models and single-stage training. Unlike preceding methods such as WaveNet, Parallel WaveNet, or ClariNet, which faced limitations due to high inference times or complex training procedures involving auxiliary loss terms, FloWaveNet achieves efficiencies through flow-based generative modeling. The model enhances waveform generation, offering a drop-in replacement for WaveNet vocoders often used in state-of-the-art text-to-speech architectures.

The detailed comparison indicates that FloWaveNet substantively performs as well as two-stage models like ClariNet, achieving similar audio clarity but using simpler training methodologies. Most notably, FloWaveNet’s training framework requires no pre-trained teacher network or additional auxiliary losses. In examining sound fidelity, the model exhibits comparable results to traditional approaches, despite its streamlined design.

FloWaveNet’s architecture, grounded in context blocks and flow operations, features critical mechanisms such as the WaveNet affine coupling layers and activation normalization for stabilized training. The approach is powered by an efficient invertible transformation leveraging normalizing flows, thus enabling rapid sampling of audio signals in parallel. Experimental findings show the real-world implications of these architectural advancements: WaveNet achieved sampling speeds of approximately 172 samples per second—a major bottleneck—while FloWaveNet’s non-autoregressive framework surpasses this with rates around 420,000 samples per second.

At the core of the generative flow model is the ability to alter random samples from a known distribution into data of high complexity through transformations that are both tractable and efficient. FloWaveNet is distinctly robust in its ability to perform direct maximum likelihood estimation, reducing training complexity and increasing inference speed.

Furthermore, the paper demonstrates that lowering the prior's temperature, effectively a latent traversal, can enhance sound quality—a notable empirical insight for audio synthesis applications. When evaluated under objective measures like conditional log-likelihood and subjective measures like mean opinion scores, FloWaveNet proves its worth just shy of autoregressive models in fidelity, while remaining exponentially faster.

The implications are profound for both AI development and practical applications in audio synthesis. Future research could further refine the temperature selection process, potentially leading to refined auditory output. Additionally, the discussion on causality within dilated convolutions opens avenues for improved architectures that leverage the bi-directional receptive field views in non-causal models.

In conclusion, FloWaveNet signifies a pivotal progression towards more efficient, stable, and high-fidelity audio synthesis within neural-based vocoder systems. The learnings captured in this paper offer substantial opportunities for expanding the capacity of generative models in real-time applications. The practical benefits—streamlined training, rapid inference, and equivalent fidelity—present a compelling case for more widespread adoption and continued investigation into flow-based audio modeling.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sungwon Kim (32 papers)
  2. Sang-gil Lee (15 papers)
  3. Jongyoon Song (10 papers)
  4. Jaehyeon Kim (16 papers)
  5. Sungroh Yoon (163 papers)
Citations (166)