Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech (2403.08164v2)

Published 13 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the parameters and training time of a TTS model while guaranteeing a certain performance due to their high parallelism, which alleviate these economic costs of training. In this paper, we propose a lightweight TTS system based on deep convolutional neural networks, which is a two-stage training end-to-end TTS model and does not employ any recurrent units. Our model consists of two stages: Text2Spectrum and SSRN. The former is used to encode phonemes into a coarse mel spectrogram and the latter is used to synthesize the complete spectrum from the coarse mel spectrogram. Meanwhile, we improve the robustness of our model by a series of data augmentations, such as noise suppression, time warping, frequency masking and time masking, for solving the low resource mongolian problem. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech compared to using mainstream TTS models. Our method uses NCMMSC2022-MTTSC Challenge dataset for validation, which significantly reduces training time while maintaining a certain accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. L. Rui, B. Feilong, G. Guanglai, and W. Yonghe, “Mongolian text-to-speech system based on deep neural network,” National conference on Man-Machine Speech Communication, NCMMSC, pp. 99–108, 2018.
  2. Y. Fan, Y. Qian, F. Xie, and F. K. Soong, “TTS synthesis with bidirectional LSTM based recurrent neural networks,” in 15th Annual Conference of the International Speech Communication Association Interspeech, 2014, pp. 1964–1968.
  3. H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2015, pp. 4470–4474.
  4. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in 18th Annual Conference of the International Speech Communication Association, Interspeech, 2017, pp. 4006–4010.
  5. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2019, pp. 3165–3174.
  6. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in 9th International Conference on Learning Representations, ICLR, 2021.
  7. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020.
  8. R. Yamamoto, E. Song, and J. Kim, “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6199–6203.
  9. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ICML, vol. 139, 2021, pp. 5530–5540.
  10. E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in International Conference on Machine Learning, ICML, vol. 162, 2022, pp. 2709–2720.
  11. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. K. Soong, T. Qin, S. Zhao, and T. Liu, “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” CoRR, vol. abs/2205.04421, 2022.
  12. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” CoRR, vol. abs/2301.02111, 2023.
  13. E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” CoRR, vol. abs/2302.03540, 2023.
  14. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. A. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in Proceedings of the 38th International Conference on Machine Learning, ICML, vol. 139, 2021, pp. 8599–8608.
  15. M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-tts: A denoising diffusion model for text-to-speech,” in 22nd Annual Conference of the International Speech Communication Association, Interspeech, 2021, pp. 3605–3609.
  16. H. Kim, S. Kim, and S. Yoon, “Guided-tts: A diffusion model for text-to-speech via classifier guidance,” in International Conference on Machine Learning, ICML, vol. 162, 2022, pp. 11 119–11 133.
  17. S. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T. Liu, “Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” in The Tenth International Conference on Learning Representations, ICLR, 2022.
  18. X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Semi-supervised learning based on reference model for low-resource tts,” in 2022 18th International Conference on Mobility, Sensing and Networking, MSN.   IEEE, 2022, pp. 966–971.
  19. Zhang, Xulong and Wang, Jianzong and Cheng, Ning and Xiao, Jing, “Tdass: Target domain adaptation speech synthesis framework for multi-speaker low-resource tts,” in 2022 International Joint Conference on Neural Networks, IJCNN.   IEEE, 2022, pp. 1–7.
  20. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop, SSW, 2016, p. 125.
  21. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2019, pp. 14 881–14 892.
  22. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2018, pp. 4779–4783.
  23. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in Proceedings of the 35th International Conference on Machine Learning, ICML, vol. 80, 2018, pp. 2415–2424.
  24. “Mongolian text-to-speech challenge under low-resource scenario.” in http://mglip.com/challenge/NCMMSC2022-MTTSC.
  25. H. Choi, S. Park, J. Park, and M. Hahn, “Multi-speaker emotional acoustic modeling for cnn-based speech synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6950–6954, 2019.
  26. T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Speech Audio Process, TASLP, vol. 15, no. 8, pp. 2222–2235, 2007.
  27. J. Benesty, J. Chen, and Y. Huang, “On the importance of the pearson correlation coefficient in noise reduction,” IEEE Trans. Speech Audio Process, TASLP, vol. 16, no. 4, pp. 757–765, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ziqi Liang (10 papers)
  2. Haoxiang Shi (13 papers)
  3. Jiawei Wang (128 papers)
  4. Keda Lu (3 papers)

Summary

We haven't generated a summary for this paper yet.