CAMP: a Two-Stage Approach to Modelling Prosody in Context (2011.01175v2)

Published 2 Nov 2020 in eess.AS

Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

Citations (28)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

CAMP: a Two-Stage Approach to Modelling Prosody in Context (2011.01175v2)

Summary

Related Papers