Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis (2405.09171v1)

Published 15 May 2024 in cs.SD and eess.AS

Abstract: It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (33)

Authors (4)

Sho Inoue (8 papers)
Kun Zhou (217 papers)
Shuai Wang (466 papers)
Haizhou Li (286 papers)

Citations (6)

View on Semantic Scholar

Tweets

https://twitter.com/ArxivSound/status/1790956363599757350

https://twitter.com/gm8xx8/status/1790940662004625653

https://twitter.com/AudioAndSpeech/status/1790992334835925491

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis (2405.09171v1)

Related Papers

Tweets