Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Incremental FastPitch: Chunk-based High Quality Text to Speech (2401.01755v1)

Published 3 Jan 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783.
  2. “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, vol. 32, 2019.
  3. Adrian Lańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.
  4. “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
  5. “Wavenet: A generative model for raw audio, corr, vol. abs/1609.03499,” 2017.
  6. “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
  7. “Improving WaveRNN with Heuristic Dynamic Blending for Fast and High-Quality GPU Vocoding,” in Proc. INTERSPEECH 2023, 2023, pp. 4344–4348.
  8. “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
  9. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
  10. “High quality streaming speech synthesis with low, sentence-length-independent latency,” pp. 2022–2026, ISCA.
  11. “Lpcnet: Improving neural speech synthesis through linear prediction,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5891–5895.
  12. “Efficient incremental text-to-speech on gpus,” in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2023, pp. 1422–1428.
  13. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, 2019, vol. 1, p. 2.
  14. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  15. “WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2021, 2021, pp. 4054–4058.
  16. “WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2022, 2022, pp. 1661–1665.
  17. Databaker, “Chinese standard mandarin speech corpus,” https://www.data-baker.com/open_source.html, 2023, Accessed: September 3, 2023.
  18. NVIDIA, “Fastpitch,” https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch, 2023, Accessed: September 3, 2023.
  19. “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun, Eds., 2015.
  20. “Understanding gradient clipping in private sgd: A geometric perspective,” Advances in Neural Information Processing Systems, vol. 33, pp. 13773–13782, 2020.

Summary

We haven't generated a summary for this paper yet.