XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model (2406.04904v1)

Published 7 Jun 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.

Authors (11)

Edresson Casanova (20 papers)
Kelly Davis (2 papers)
Eren Gölge (7 papers)
Görkem Göknar (1 paper)
Iulian Gulea (1 paper)
Logan Hart (2 papers)
Aya Aljafari (1 paper)
Joshua Meyer (2 papers)
Reuben Morais (2 papers)
Samuel Olayemi (1 paper)
Julian Weber (5 papers)

Citations (43)

View on Semantic Scholar

Summary

The paper presents XTTS, a novel zero-shot TTS model that supports 16 languages using architectural innovations like VQ-VAE and enhanced GPT-2 encoding.
The model achieves competitive naturalness and superior character error rates compared to state-of-the-art benchmarks in multilingual scenarios.
XTTS’s efficient design paves the way for future research in speech synthesis, particularly for low- and medium-resource languages.

XTTS: A Massively Multilingual Zero-Shot Text-to-Speech Model

This paper introduces XTTS, a novel development in the field of zero-shot text-to-speech (ZS-TTS) systems, specifically targeting multilingual capabilities. XTTS addresses limitations in existing ZS-TTS models that primarily focus on monolingual or a limited set of high-resource languages. Leveraging the Tortoise framework, XTTS proposes architectural modifications to facilitate multilingual training across 16 languages, integrating advancements for better voice cloning and enhancing the efficiency of both training and inference processes.

Model Architecture and Techniques

XTTS enhances the foundational Tortoise model by implementing a Vector Quantised-Variational AutoEncoder (VQ-VAE) for multilingual training. This component reduces the latent space dimensionality, enabling more efficient text-to-speech even in complex multilingual scenarios. The VQ-VAE is optimized via codebook filtration, retaining only the most frequently used codes to improve expressiveness. Additionally, the model incorporates a GPT-2-based encoder and a HiFi-GAN vocoder for robust speech synthesis. The GPT-2 is conditioned with enhanced embeddings, and linguistic preprocessing such as text romanization for certain languages (e.g., Korean, Japanese, Chinese) is performed to unify input processing across scripts.

Experimental Setup and Results

The XTTS model was evaluated against state-of-the-art (SOTA) competitors such as StyleTTS 2, Mega-TTS 2, and HierSpeech++. The authors benchmarked XTTS using criteria like UTMOS for naturalness and SECS for speaker similarity. It demonstrated competitive performance even when compared with monolingual models trained on extensive datasets. Notably, XTTS achieved superior character error rates (CER) across its supported languages, showcasing its proficiency in both text comprehension and speech generation. The multilingual evaluation underscored XTTS's capability as a SOTA candidate, particularly in handling lesser-resourced languages, thereby rendering it a versatile application for global linguistic contexts.

Implications and Future Directions

The implementation of XTTS marks a significant advancement in large-scale multilingual TTS systems. By supporting a broad range of languages, including low- and medium-resource languages, XTTS broadens the applicability of TTS systems in diverse linguistic settings, enhancing accessibility to digitally underserved populations. From a theoretical perspective, XTTS's architecture may inform future research in efficient multi-LLM training, as it underscores the value of integrating linguistic preprocessing and codebook optimization in multilingual TTS.

Potential future research could explore further disentangling prosody and speaker characteristics to enhance cross-speaker prosody transfer capabilities. Additionally, investigating the integration of VQ-VAE decoder mechanisms could streamline the synthesis pipeline and improve real-time application scalability. Addressing these avenues will continue to elevate the scope and efficiency of multilingual AI-driven speech synthesis technologies.

The XTTS model thus sets a precedent for forthcoming advancements in TTS, balancing performance and resource efficiency while embracing linguistic diversity and integration.

Related Papers

Tweets

https://twitter.com/arxivsanitybot/status/1800158382839476482

YouTube

Show All Videos