Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis (2401.10460v1)
Abstract: Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.
- “Wavenet: A generative model for raw audio,” in Arxiv, 2016.
- “Samplernn: An unconditional end-to-end neural audio generation model,” 2017.
- “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
- “Melgan: Generative adversarial networks for conditional waveform synthesis,” 2019.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
- “Waveglow: A flow-based generative network for speech synthesis,” 2018.
- “Multi-band melgan: Faster waveform generation for high-quality text-to-speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 492–498.
- “Lpcnet: Improving neural speech synthesis through linear prediction,” 2019.
- “Ddsp: Differentiable digital signal processing,” 2020.
- “Neural homomorphic vocoder,” in Interspeech, 2020.
- International Phonetic Association, “The international phonetic alphabet,” https://www.internationalphoneticassociation.org/sites/default/files/IPA_Kiel_2015.pdf, 2015.
- “Multi-rate attention architecture for fast streamable text-to-speech spectrum modeling,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5689–5693.
- “Massively multilingual word embeddings,” 2016.
- “Cross-lingual models of word embeddings: An empirical comparison,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, Aug. 2016, pp. 1661–1670, Association for Computational Linguistics.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6783–6787.
- Theory and Applications of Digital Speech Processing, Prentice Hall Press, USA, 1st edition, 2010.
- “Mixed excitation for hmm-based speech synthesis,” 09 2001, pp. 2263–2266.
- “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
- D. Griffin and Jae Lim, “A new model-based speech analysis/synthesis system,” in ICASSP ’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985, vol. 10, pp. 513–516.
- “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol. 27, no. 3, pp. 187–207, 1999.
- “Spectral envelope estimation and representation for sound analysis-synthesis,” Proc. ICMC, 09 1999.
- Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall PTR, USA, 1st edition, 2001.
- “Image-to-image translation with conditional adversarial networks,” CoRR, vol. abs/1611.07004, 2016.
- “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” Advances in neural information processing systems, vol. 29, 2016.
- “Unofficial parallel wavegan (+ melgan & multi-band melgan & hifi-gan & stylemelgan) with pytorch,” https://github.com/kan-bayashi/ParallelWaveGAN, Accessed: 2023-08-31.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” https://github.com/jik876/hifi-gan, Accessed: 2023-08-31.
- “Torchscript - pytorch 2.0 documentation,” https://pytorch.org/docs/stable/jit.html, Accessed: 2023-08-31.
- Vladislav Sovrasov, “ptflops: a flops counting tool for neural networks in pytorch framework,” 2018-2023.