Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation (2306.10152v1)

Published 16 Jun 2023 in eess.AS and cs.SD

Abstract: Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. “Deep Voice 2: Multi-speaker neural text-to-speech,” in Advances in Neural Information Processing Systems, 2017, vol. 30.
  2. “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783.
  3. “LRSpeech: Extremely low-resource speech synthesis and recognition,” in Proc. ACM Intl. Conf. on Knowledge Discovery & Data Mining (SIGKDD), 2020, pp. 2802–2812.
  4. K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  5. “Exploring transfer learning for low resource emotional TTS,” in Proc. SAI Intelligent Systems Conference. Springer, 2019, pp. 52–60.
  6. “Combining speakers of multiple languages to improve quality of neural voices,” in Proc. ISCA Speech Synthesis Workshp, 2021, pp. 37–42.
  7. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  8. “Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification,” in Proc. Interspeech, 2019, pp. 1163–1167.
  9. “Non-autoregressive TTS with explicit duration modelling for low-resource highly expressive speech,” in Proc. ISCA Speech Synthesis Workshp, 2021, pp. 96–101.
  10. “Low-resource expressive text-to-speech using data augmentation,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6593–6597.
  11. “CopyCat: Many-to-many fine-grained prosody transfer for neural text-to-speech,” in Proc. Interspeech, 2020, pp. 4387–4391.
  12. “StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes,” in Proc. Interspeech, 2020, pp. 3550–3554.
  13. “TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6598–6602.
  14. “TTS-by-TTS 2: Data-Selective Augmentation for Neural Speech Synthesis Using Ranking Support Vector Machine with Variational Autoencoder,” in Proc. Interspeech, 2022, pp. 1941–1945.
  15. “Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?,” in Proc. Interspeech, 2020, pp. 3979–3983.
  16. “Speaker verification-derived loss and data augmentation for dnn-based multispeaker speech synthesis,” in Proc. IEEE-SPS European Signal Processing Conf., 2021, pp. 26–30.
  17. “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6940–6944.
  18. “Distribution augmentation for low-resource expressive text-to-speech,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8307–8311.
  19. “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” 2018, vol. 31.
  20. M. Brookes, “Voicebox: Speech processing toolbox for matlab,” http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, 2000.
  21. “StyleMelGAN: An efficient high-fidelity adversarial vocoder with temporal adaptive normalization,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6034–6038.
  22. “The PAVOQUE corpus as a resource for analysis and synthesis of expressive speech,” Proc. Phonetik & Phonologie, vol. 9, 2013.
  23. K. Park and T. Mulc, “CSS10: A collection of single speaker speech datasets for 10 languages,” Proc. Interspeech, pp. 1566–1570, 2019.
  24. “The BITS speech synthesis corpus for German,” in Proc. (ELRA) Intl. Conf. on Language Resources and Evaluation, 2004.
  25. J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Proc. ISCA Speech Synthesis Workshp, 2004.
  26. C. Schäfer, “ForwardTacotron,” https://github.com/as-ideas/ForwardTacotron, 2020.
  27. ITUT Rec, “P. 808, subjective evaluation of speech quality with a crowdsourcing approach,” ITU-T, Geneva, 2018.
  28. “webMUSHRA—a comprehensive framework for web-based listening tests,” Journal of Open Research Software, vol. 6, no. 1, 2018.
  29. “IEEE recommended practice for speech quality measurements,” IEEE Transactions on Audio and Electroacoustics, vol. 17, 1969.
  30. “The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences,” vol. 18, no. 4, pp. 381–392, 1996.
  31. A Black and K. Tokuda, “The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common databases,” in Proc. Interspeech, 2005, pp. 77–80.
  32. “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  33. Henneke, “How to captivate readers with a dazzling loooong sentence,” https://www.enchantingmarketing.com/how-to-write-a-long-sentence/, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Kishor Kayyar Lakshminarayana (3 papers)
  2. Christian Dittmar (3 papers)
  3. Nicola Pia (15 papers)
  4. Emanuël Habets (4 papers)

Summary

We haven't generated a summary for this paper yet.