Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning (1907.04448v2)

Published 9 Jul 2019 in cs.CL, cs.SD, and eess.AS

Abstract: We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages. Moreover, the model is able to transfer voices across languages, e.g. synthesize fluent Spanish speech using an English speaker's voice, without training on any bilingual or parallel examples. Such transfer works across distantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic input representation to encourage sharing of model capacity across languages, and 2. incorporating an adversarial loss term to encourage the model to disentangle its representation of speaker identity (which is perfectly correlated with language in the training data) from the speech content. Further scaling up the model by training on multiple speakers of each language, and incorporating an autoencoding input to help stabilize attention during training, results in a model which can be used to consistently synthesize intelligible speech for training speakers in all languages seen during training, and in native or foreign accents.

PDF Abstract

Multilingual Speech Synthesis and Voice Cloning

This paper introduces advanced methodologies in the development of a multilingual text-to-speech (TTS) synthesis model, incorporating enhancements for cross-language voice cloning. The work is built upon the Tacotron framework, modified to produce high-quality speech across multiple languages, notably leveraging phonemic input representations and adversarial loss. These advancements enable the successful transfer of speaker identity across languages without bilingual training examples, which is particularly demonstrative in the synthesis of Spanish speech using an English speaker's voice, as well as in English-Mandarin language transfer.

Model Design and Architecture

The proposed model utilizes Tacotron as a foundational architecture, integrating several modifications to better accommodate multilingual input and cross-lingual voice cloning capabilities. The TTS system employs phonemic input representations to optimize model capacity sharing between different languages, and introduces adversarial training to disentangle speaker identity from language characteristics. The inclusion of an autoencoding input component further stabilizes training, ensuring more consistent synthesis of speech across diverse language accents.

Key components of the model include:

Phonemic Input Representation: By leveraging phonemic inputs, the model facilitates efficient cross-lingual voice transfer with minimal reliance on language-dependent text encoders. This approach allows for processing across languages with different script systems.
Adversarial Loss Integration: An adversarial loss element promotes the separation of speaker identity from linguistic features, which is critical for cross-language voice cloning.
Residual Encoder: Inspired by variational autoencoder structures, the residual encoder controls latent embeddings derived from training audio, enhancing synthesis clarity and maintaining speaker identity across language shifts.

Experimental Evaluations and Results

The experiments conducted in this paper explored varied input representations such as UTF-8 byte encoding and phonemic inputs, focusing on their respective impacts on synthesis quality and speaker similarity across languages. The multilingual model, evaluated with both single speaker per language and multi-speaker datasets, demonstrated significant potential in producing high-quality speech synonymous with target voice identities irrespective of language differences.

Strong empirical findings underscore the model's proficient cross-language cloning capabilities, with phoneme-based inputs showing superior performance in synthesis naturalness. The model achieves consistently high MOS for speaker similarity, indicating effective voice identity retention across synthesized languages.

Implications and Future Directions

This work holds substantial implications for AI-driven speech synthesis, suggesting avenues for enhanced multilingual models lacking bilingual training datasets. The techniques introduced here can be applied to expand voice cloning capabilities across underrepresented languages in digital spaces, potentially offering solutions for low-resource linguistic scenarios.

As future research directions, expanding training data diversity and exploring more scalable models to integrate broader linguistic characteristics are potential areas of focus. Investigations into reinforcement learning methodologies or better handling of code-switching scenarios are also pertinent.

This paper advances the fields of speech synthesis and voice cloning by demonstrating that it is feasible to synthesize speech in multiple languages and transfer voice identity across linguistic boundaries, employing innovative approaches such as adversarial training and phonemic processing within TTS models.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Yu Zhang (1399 papers)
Ron J. Weiss (30 papers)
Heiga Zen (36 papers)
Yonghui Wu (115 papers)
Zhifeng Chen (65 papers)
RJ Skerry-Ryan (21 papers)
Ye Jia (33 papers)
Andrew Rosenberg (32 papers)
Bhuvana Ramabhadran (47 papers)

Citations (180)

View on Semantic Scholar

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning (1907.04448v2)

Multilingual Speech Synthesis and Voice Cloning

Model Design and Architecture

Experimental Evaluations and Results

Implications and Future Directions

Related Papers