Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prior-agnostic Multi-scale Contrastive Text-Audio Pre-training for Parallelized TTS Frontend Modeling (2404.09192v1)

Published 14 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), pages 5723–5738, Dublin, Ireland. Association for Computational Linguistics.
  2. Zilong Bai and Beibei Hu. 2021. A universal bert-based front-end model for mandarin text-to-speech synthesis. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6074–6078. IEEE.
  3. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3670–3674. ISCA.
  4. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  5. A character-level span-based model for mandarin prosodic structure prediction. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7602–7606. IEEE.
  6. Alistair Conkie and Andrew Finch. 2020. Scalable multilingual frontend for TTS. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 6684–6688. IEEE.
  7. Disambiguation of chinese polyphones in an end-to-end framework with semantic features extracted by pre-trained BERT. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 2090–2094. ISCA.
  8. An end-to-end chinese text normalization model based on rule-guided flat-lattice transformer. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7122–7126. IEEE.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Peter Ebden and Richard Sproat. 2015. The kestrel tts text normalization system. Natural Language Engineering, 21(3):333–353.
  11. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  12. Dual encoder classifier models as constraints in neural text normalization. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 4489–4493. ISCA.
  13. Improving homograph disambiguation with supervised machine learning. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  14. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 5036–5040. ISCA.
  15. Dual transformer for point cloud analysis. IEEE Transactions on Multimedia.
  16. Image-based 3d object reconstruction: State-of-the-art and trends in the deep learning era. IEEE transactions on pattern analysis and machine intelligence, 43(5):1578–1604.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  18. Polyphone disambiguation and accent prediction using pre-trained language models in japanese tts front-end. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7132–7136. IEEE.
  19. Keith Ito and Linda Johnson. 2017. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
  20. Cold-start and interpretability: Turning regular expressions into trainable recurrent neural networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3193–3207, Online. Association for Computational Linguistics.
  21. Improving neural text normalization with partial parameter generator and pointer-generator network. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7583–7587.
  22. Cross-lingual transfer learning for phrase break prediction with multilingual language model. ArXiv preprint, abs/2306.02579.
  23. Text normalization for mandarin tts by using keyword information. In 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), pages 73–78. IEEE.
  24. End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 1871–1880. Computer Vision Foundation / IEEE.
  25. Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 18–28, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  26. Inequality maximum entropy classifier with character features for polyphone disambiguation in mandarin tts systems. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–705. IEEE.
  27. Marco Nicolis and Viacheslav Klimkov. 2021. Homograph disambiguation with contextual word embeddings for tts systems.
  28. A mandarin prosodic boundary prediction model based on multi-task learning. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 4485–4488. ISCA.
  29. A unified sequence-to-sequence front-end model for mandarin text-to-speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 6689–6693. IEEE.
  30. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE.
  31. Automatic prosody prediction and detection with conditional random field (crf) models. In 2010 7th International Symposium on Chinese Spoken Language Processing, pages 135–138. IEEE.
  32. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  33. Fastspeech 2: Fast and high-quality end-to-end text to speech. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  34. Richard Sproat and Navdeep Jaitly. 2016. Rnn approaches to text normalization: A challenge. ArXiv preprint, abs/1611.00068.
  35. Knowledge distillation from bert in pre-training and fine-tuning for polyphone disambiguation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 168–175. IEEE.
  36. Improving seq2seq tts frontends with transcribed speech audio. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  37. George Terzopoulos and Maya Satratzemi. 2020. Voice assistants and smart speakers in everyday life and in education. Informatics in Education, 19(3):473–490.
  38. Tacotron: Towards end-to-end speech synthesis. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, pages 4006–4010. ISCA.
  39. Masked conditional random fields for sequence labeling. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2024–2035, Online. Association for Computational Linguistics.
  40. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  41. Clapspeech: Learning prosody from text context with contrastive language-audio pre-training. ACL 2023 (Main Conference), abs/2305.10763.
  42. Clapspeech: Learning prosody from text context with contrastive language-audio pre-training. ArXiv preprint, abs/2305.10763.
  43. a unified front-end framework for english text-to-speech synthesis. ArXiv preprint, abs/2305.10666.
  44. Haiteng Zhang. 2021. Polyphone disambiguation in chinese by using flat. In Interspeech, pages 4099–4103.
  45. A hybrid text normalization system using multi-head self-attention for mandarin. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 6694–6698. IEEE.
  46. A polyphone bert for polyphone disambiguation in mandarin chinese. ArXiv preprint, abs/2207.12089.
  47. Unified mandarin tts front-end based on distilled bert model. ArXiv preprint, abs/2012.15404.
  48. Mandarin prosodic phrase prediction based on syntactic trees. In SSW, pages 160–165.
  49. BLSTM-CRF based end-to-end prosodic boundary prediction with context sensitive embeddings in a text-to-speech front-end. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 47–51. ISCA.
  50. Multi-modal automatic prosody annotation with contrastive pretraining of sswp. ArXiv preprint, abs/2309.05423.
  51. Document structure analysis and text normalization for chinese putonghua and cantonese text-to-speech synthesis. In 2008 Second International Symposium on Intelligent Information Technology Application, volume 1, pages 477–481. IEEE.

Summary

We haven't generated a summary for this paper yet.