Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing (2211.06989v2)

Published 13 Nov 2022 in cs.SD and eess.AS

Abstract: Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our proposed ``autovocoder'' reverses this arrangement. We use machine learning to obtain a representation that replaces the mel-spectrogram, and that can be inverted back to a waveform using simple, fast operations including a differentiable implementation of the inverse STFT. The autovocoder generates a waveform 5 times faster than the DSP-based Griffin-Lim algorithm, and 14 times faster than the neural vocoder HiFi-GAN. We provide perceptual listening test results to confirm that the speech is of comparable quality to HiFi-GAN in the copy synthesis task.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jacob J Webber (2 papers)
  2. Cassia Valentini-Botinhao (5 papers)
  3. Evelyn Williams (1 paper)
  4. Gustav Eje Henter (51 papers)
  5. Simon King (28 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.