Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech (2011.02252v1)

Published 4 Nov 2020 in eess.AS, cs.CL, and cs.SD

Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

Citations (19)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech (2011.02252v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (7)