Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech (2401.10465v1)

Published 19 Jan 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high. In this paper, we eliminate both of these issues by using recent advances in self-supervised learning to obtain data-driven phoneme representations instead of fixed representations. We compare our lexicon-free approach against strong baselines that utilize a well-crafted lexicon. Furthermore, we show that our data-driven lexicon-free method performs as good or even marginally better than the conventional rule-based or lexicon-based neural G2Ps in terms of Mean Opinion Score (MOS) while using no prior language lexicon or phoneme set, i.e. no linguistic expertise.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE ICASSP.
  2. “Wavenet: A generative model for raw audio,” in The 9th ISCA Speech Synthesis Workshop September 2016. 2016, p. 125, ISCA.
  3. “Transformer based grapheme-to-phoneme conversion,” in Interspeech 2019. sep 2019, ISCA.
  4. “Neural speech synthesis with transformer network,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6706–6713, 07 2019.
  5. “Improved multi-stage training of online attention-based encoder-decoder models,” in 2019 IEEE ASRU.
  6. “Streaming on-device end-to-end asr system for privacy-sensitive voice-typing,” in Interspeech, 2020.
  7. “Hierarchical multi-stage word-to-grapheme named entity corrector for automatic speech recognition,” in Interspeech, 2020.
  8. “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018.
  9. “A comparison of streaming models and data augmentation methods for robust speech recognition,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 989–995.
  10. “Streaming end-to-end speech recognition with jointly trained neural feature enhancement,” in ICASSP 2021 - 2021.
  11. “Multi-task multi-resolution char-to-bpe cross-attention decoder for end-to-end speech recognition,” in Interspeech, 2019.
  12. “Cmudict (the carnegie mellon pronouncing dictionary),” https://github.com/cmusphinx/cmudict.
  13. “Byt5 model for massively multilingual grapheme-to-phoneme conversion,” in Interspeech, 2022.
  14. “T5g2p: Using text-to-text transfer transformer for grapheme-to-phoneme conversion,” in Interspeech, 2021.
  15. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  16. “Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks,” in 2015 IEEE ICASSP, 2015, pp. 4225–4229.
  17. “Sequence-to-sequence neural net models for grapheme-to-phoneme conversion,” arXiv preprint arXiv:1506.00196, 2015.
  18. “Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis,” in International Conference on Learning Representations, 2021.
  19. “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” 2022.
  20. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 12449–12460, Curran Associates, Inc.
  21. “Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH, 2017.
  22. “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  23. “Transformer based grapheme-to-phoneme conversion,” in Interspeech 2019. Sep 2019, ISCA.
  24. “Grapheme-to-phoneme conversion with convolutional neural networks,” Applied Sciences, 2019.
  25. “Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the wfst framework,” Natural Language Engineering, vol. 22, no. 6, pp. 907–938, 2016.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com