Exploring speech style spaces with language models: Emotional TTS without emotion labels (2405.11413v1)
Abstract: Many frameworks for emotional text-to-speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents a tough challenge due to the subjective nature of emotions. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels or text prompts. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework, showcasing improvements in emotional accuracy and naturalness. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.
- “Tacotron: Towards end-to-end speech synthesis,” in Interspeech, 2017.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2021.
- Adrian Łańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.
- “Expressive tts training with frame and style reconstruction loss,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1806–1818, 2021.
- “Emotional end-to-end neural speech synthesizer,” in NIPS2017. Neural Information Processing Systems Foundation, 2017.
- “An effective style token weight control technique for end-to-end emotional speech synthesis,” IEEE Signal Processing Letters, vol. 26, pp. 1383–1387, 2019.
- “Emotional speech synthesis with rich and granularized control,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7254–7258, 2019.
- “Emotional speech synthesis based on style embedded tacotron2 framework,” 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), pp. 1–4, 2019.
- “Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,” in Interspeech, 2021.
- “Text-driven emotional style control and cross-speaker style transfer in neural tts,” in Interspeech, 2022.
- “An emotion speech synthesis method based on vits,” Applied Sciences, vol. 13, no. 4, pp. 2225, 2023.
- “Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5734–5738, 2020.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics, 2019.
- “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
- “End-to-end emotional speech synthesis using style tokens and semi-supervised training,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 623–627.
- “Improve emotional speech synthesis quality by learning explicit and implicit representations with semi-supervised training,” Proc. Interspeech 2022, pp. 5538–5542, 2022.
- “Predicting expressive speaking style from text in end-to-end speech synthesis,” 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 595–602, 2018.
- “Msstyletts: Multi-scale style modeling with hierarchical context information for expressive speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Emovie: A mandarin emotion speech dataset with a simple emotional text-to-speech model,” in Interspeech, 2021.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
- “Speech bert embedding for improving prosody in neural tts,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6563–6567, 2021.
- “Improving prosody with linguistic and bert derived features in multi-speaker based mandarin chinese neural tts,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6704–6708.
- “Text aware emotional text-to-speech with bert,” in Interspeech, 2022.
- “Self-supervised Context-aware Style Representation for Expressive Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 5503–5507.
- “Expressive text-to-speech using style tag,” in Interspeech, 2021.
- “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” arXiv preprint arXiv:2301.13662, 2023.
- “Prompttts: Controllable text-to-speech with text descriptions,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022.
- “Language model-based emotion prediction methods for emotional speech synthesis systems,” in Interspeech, 2022.
- “Emotional Prosody Control for Speech Generation,” in Proc. Interspeech 2021, 2021, pp. 4653–4657.
- “Empathic machines: using intermediate features as levers to emulate emotions in text-to-speech systems,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 336–347.
- Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.
- “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- “Emotional TTS Dataset in Indian English,” Dec. 2021.
- “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
- “Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6199–6203.
- “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- Shreeram Suresh Chandra (6 papers)
- Zongyang Du (7 papers)
- Berrak Sisman (49 papers)