Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis (2401.10460v1)

Published 19 Jan 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. “Wavenet: A generative model for raw audio,” in Arxiv, 2016.
  2. “Samplernn: An unconditional end-to-end neural audio generation model,” 2017.
  3. “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
  4. “Melgan: Generative adversarial networks for conditional waveform synthesis,” 2019.
  5. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
  6. “Waveglow: A flow-based generative network for speech synthesis,” 2018.
  7. “Multi-band melgan: Faster waveform generation for high-quality text-to-speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 492–498.
  8. “Lpcnet: Improving neural speech synthesis through linear prediction,” 2019.
  9. “Ddsp: Differentiable digital signal processing,” 2020.
  10. “Neural homomorphic vocoder,” in Interspeech, 2020.
  11. International Phonetic Association, “The international phonetic alphabet,” https://www.internationalphoneticassociation.org/sites/default/files/IPA_Kiel_2015.pdf, 2015.
  12. “Multi-rate attention architecture for fast streamable text-to-speech spectrum modeling,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5689–5693.
  13. “Massively multilingual word embeddings,” 2016.
  14. “Cross-lingual models of word embeddings: An empirical comparison,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, Aug. 2016, pp. 1661–1670, Association for Computational Linguistics.
  15. “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6783–6787.
  16. Theory and Applications of Digital Speech Processing, Prentice Hall Press, USA, 1st edition, 2010.
  17. “Mixed excitation for hmm-based speech synthesis,” 09 2001, pp. 2263–2266.
  18. “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
  19. D. Griffin and Jae Lim, “A new model-based speech analysis/synthesis system,” in ICASSP ’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10, pp. 513–516.
  20. “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol. 27, no. 3, pp. 187–207, 1999.
  21. “Spectral envelope estimation and representation for sound analysis-synthesis,” Proc. ICMC, 09 1999.
  22. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, USA, 1st edition, 2001.
  23. “Image-to-image translation with conditional adversarial networks,” CoRR, vol. abs/1611.07004, 2016.
  24. “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
  25. “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  26. “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  27. “Unofficial parallel wavegan (+ melgan & multi-band melgan & hifi-gan & stylemelgan) with pytorch,” https://github.com/kan-bayashi/ParallelWaveGAN, Accessed: 2023-08-31.
  28. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” https://github.com/jik876/hifi-gan, Accessed: 2023-08-31.
  29. “Torchscript - pytorch 2.0 documentation,” https://pytorch.org/docs/stable/jit.html, Accessed: 2023-08-31.
  30. Vladislav Sovrasov, “ptflops: a flops counting tool for neural networks in pytorch framework,” 2018-2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com