Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis (2405.09171v1)
Abstract: It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
- Marc Schröder, “Emotional speech synthesis: A review,” in Seventh European Conference on Speech Communication and Technology, 2001.
- “An overview of affective speech synthesis and conversion in the deep learning era,” Proceedings of the IEEE, 2023.
- “The age of artificial emotional intelligence,” Computer, vol. 51, no. 9, pp. 38–46, 2018.
- Handling emotions in human-computer dialogues, Springer, 2010.
- ZHOU KUN, “Emotion modelling for speech generation,” 2022.
- “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
- Julia Hirschberg, “Pragmatics and intonation,” The handbook of pragmatics, pp. 515–537, 2004.
- “Perception of affective and linguistic prosody: an ale meta-analysis of neuroimaging studies,” Social cognitive and affective neuroscience, vol. 9, no. 9, pp. 1395–1403, 2014.
- “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2023.
- “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, pp. 1–16, 2022.
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” 2018.
- “Relative attributes,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 503–510.
- “Controlling emotion strength with relative attribute for end-to-end speech synthesis,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 192–199.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2022.
- “Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” 2022.
- “Predicting expressive speaking style from text in end-to-end speech synthesis,” CoRR, vol. abs/1808.01410, 2018.
- “Mixture density network for phone-level prosody modelling in speech synthesis,” CoRR, vol. abs/2102.00851, 2021.
- “Towards multi-scale speaking style modelling with hierarchical context information for mandarin speech synthesis,” 2022.
- “Text-driven emotional style control and cross-speaker style transfer in neural tts,” 2022.
- “Text aware emotional text-to-speech with bert,” 09 2022, pp. 4601–4605.
- “Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” 2022.
- “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
- “opensmile – the munich versatile and fast open-source audio feature extractor,” 01 2010, pp. 1459–1462.
- “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
- “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.
- “Attention is all you need,” 2017.
- “Adam: A method for stochastic optimization,” 2017.
- “The blizzard challenge 2013,” 2014.
- “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
- “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
- “Emotion in speech: The acoustic attributes of fear, anger, sadness, and joy,” Journal of psycholinguistic research, vol. 28, pp. 347–65, 08 1999.
- Sho Inoue (8 papers)
- Kun Zhou (217 papers)
- Shuai Wang (466 papers)
- Haizhou Li (286 papers)