Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Published 24 Mar 2018 in cs.CL, cs.LG, cs.SD, and eess.AS | (1803.09047v1)

Abstract: We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (530)

View on Semantic Scholar

Summary

The paper introduces a Tacotron extension that learns a latent prosody embedding from reference audio to transfer expressive speech features.
It integrates a reference encoder with mel spectrogram inputs to capture prosodic attributes like intonation and rhythm without explicit labels.
Experimental results demonstrate reduced MCD and FFE, indicating improved naturalness and prosodic alignment in synthesized speech.

End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

This paper presents an innovative extension to the Tacotron architecture, aiming to enhance expressive speech synthesis through the integration of prosody transfer. The authors propose an approach to learn a latent embedding space for prosody, achieved by utilizing reference acoustic signals. This allows synthesized audio to capture the intricate prosodic attributes of a reference signal, even when synthesizing speech for a different speaker.

Methodology

The methodology revolves around augmenting the Tacotron model with a reference encoder that extracts a prosody embedding from the audio input. This embedding, which is learned in a data-driven manner, captures prosody variations such as intonation, stress, and rhythm, without requiring explicit labels. The system utilizes mel spectrograms as the basis for the prosody encoder, thereby ensuring a rich representation of the speech dynamics.

The proposed architecture involves the concatenation of a speaker-independent prosody embedding with Tacotron's text and speaker identity representations. When deployed, this mechanism effectively facilitates prosody transfer across different speakers or text inputs.

Experimental Results

The authors conducted extensive experiments using both single-speaker and multi-speaker datasets. Objective measures such as Mel Cepstral Distortion (MCD) and F0 Frame Error (FFE) denote a clear improvement when employing the prosody-augmented model compared to baseline Tacotron implementations. Subjective evaluations further corroborate these findings, with raters perceiving the synthesized speech as more prosodically aligned to the reference signal.

Implications and Future Directions

By achieving successful prosody transfer, this work unlocks new potential for improving naturalness and expressiveness in TTS systems. The primary implication is that TTS systems can adapt prosodic styles from diverse contexts, elevating user interactions in AI-driven applications like virtual assistants and automated customer service.

Future research could focus on refining the disentanglement between text and prosody, allowing greater flexibility in synthesizing varied expressions from limited data inputs. Further exploration might also address the entanglement of speaker identity and prosody, especially regarding pitch variations, to ensure optimal speaker identity preservation.

In conclusion, this research underscores a meaningful advancement in speech synthesis technology, providing a robust framework for prosody transfer that promises enhanced expressiveness and adaptability in TTS systems.

Markdown Report Issue