Multilingual Speech Synthesis and Voice Cloning
This paper introduces advanced methodologies in the development of a multilingual text-to-speech (TTS) synthesis model, incorporating enhancements for cross-language voice cloning. The work is built upon the Tacotron framework, modified to produce high-quality speech across multiple languages, notably leveraging phonemic input representations and adversarial loss. These advancements enable the successful transfer of speaker identity across languages without bilingual training examples, which is particularly demonstrative in the synthesis of Spanish speech using an English speaker's voice, as well as in English-Mandarin language transfer.
Model Design and Architecture
The proposed model utilizes Tacotron as a foundational architecture, integrating several modifications to better accommodate multilingual input and cross-lingual voice cloning capabilities. The TTS system employs phonemic input representations to optimize model capacity sharing between different languages, and introduces adversarial training to disentangle speaker identity from language characteristics. The inclusion of an autoencoding input component further stabilizes training, ensuring more consistent synthesis of speech across diverse language accents.
Key components of the model include:
- Phonemic Input Representation: By leveraging phonemic inputs, the model facilitates efficient cross-lingual voice transfer with minimal reliance on language-dependent text encoders. This approach allows for processing across languages with different script systems.
- Adversarial Loss Integration: An adversarial loss element promotes the separation of speaker identity from linguistic features, which is critical for cross-language voice cloning.
- Residual Encoder: Inspired by variational autoencoder structures, the residual encoder controls latent embeddings derived from training audio, enhancing synthesis clarity and maintaining speaker identity across language shifts.
Experimental Evaluations and Results
The experiments conducted in this paper explored varied input representations such as UTF-8 byte encoding and phonemic inputs, focusing on their respective impacts on synthesis quality and speaker similarity across languages. The multilingual model, evaluated with both single speaker per language and multi-speaker datasets, demonstrated significant potential in producing high-quality speech synonymous with target voice identities irrespective of language differences.
Strong empirical findings underscore the model's proficient cross-language cloning capabilities, with phoneme-based inputs showing superior performance in synthesis naturalness. The model achieves consistently high MOS for speaker similarity, indicating effective voice identity retention across synthesized languages.
Implications and Future Directions
This work holds substantial implications for AI-driven speech synthesis, suggesting avenues for enhanced multilingual models lacking bilingual training datasets. The techniques introduced here can be applied to expand voice cloning capabilities across underrepresented languages in digital spaces, potentially offering solutions for low-resource linguistic scenarios.
As future research directions, expanding training data diversity and exploring more scalable models to integrate broader linguistic characteristics are potential areas of focus. Investigations into reinforcement learning methodologies or better handling of code-switching scenarios are also pertinent.
This paper advances the fields of speech synthesis and voice cloning by demonstrating that it is feasible to synthesize speech in multiple languages and transfer voice identity across linguistic boundaries, employing innovative approaches such as adversarial training and phonemic processing within TTS models.