XTTS: Multilingual Zero-Shot TTS

Updated 31 January 2026

XTTS is a zero-shot text-to-speech model that synthesizes high-fidelity speech across multiple languages and dialects without requiring speaker-specific fine-tuning.
It integrates a VQ-VAE encoder/decoder, GPT-2 based autoregressive prior, and HiFi-GAN vocoder to optimize sequence modeling and enhance voice cloning capabilities.
Robust performance in low- and medium-resource languages is achieved through phoneme augmentation and dialect token injection, ensuring broad linguistic coverage.

XTTS is an extensible, massively multilingual zero-shot text-to-speech (ZS-TTS) model designed to synthesize high-fidelity speech in a large variety of speaker identities and languages without explicit speaker- or language-specific fine-tuning. Built on the Tortoise TTS framework and integrating advances from YourTTS, VALL-E X, Mega-TTS 2, and Voicebox, XTTS introduces architectural modifications and training optimizations enabling robust voice-cloning, rapid inference, and broad language support—critically extending coverage to low- and medium-resource languages such as Bangla and Arabic dialects (Casanova et al., 2024, &&&1&&&, Doan et al., 2024).

1. Core Architecture

XTTS utilizes a five-component pipeline inherited from Tortoise, but leverages three principal modules for synthesis:

VQ-VAE Audio Encoder/Decoder: Converts raw waveform inputs (or mel-spectrograms $\mathbf{M}\in\mathbb{R}^{T\times F}$ ) to continuous latent vectors, discretizes via a learned codebook, and reconstructs $\hat{\mathbf{x}} = D(z)$ .
GPT-2-Based Autoregressive Prior: Acts as a joint “acoustic LLM,” autoregressively predicting both text tokens ( $t_1\ldots t_n$ ) and discrete audio tokens ( $a_1\ldots a_l$ ) using self-attention over concatenated linguistic and acoustic streams. Speaker identity is injected via a Perceiver Resampler producing a fixed-length speaker embedding $\mathbf{h}$ .
HiFi-GAN Vocoder: Decodes latent representations or predicted audio tokens into high-fidelity waveforms, further conditioned on speaker embeddings.

The input sequence structure is:

1	[BOS_SPK], s_1…s_k, [BOS_TTS], [lang], ([dialect]), t_1…t_n, [EOS_TTS], [BOS_AUD], a_1…a_l, [EOS_AUD]

A detailed schematic is summarized in the table below:

Module	Function	Key Innovations
VQ-VAE	Quantizes waveform/mel into codebook tokens	Aggressive codebook pruning
GPT-2 Autoregressive Prior	Models text+audio code sequences	Perceiver multi-slice prompt
HiFi-GAN Vocoder	Generates waveform from tokens+embedding	Speaker embedding addition

XTTS parameterization includes 13M for VQ-VAE, 443M for GPT-2 prior, and 26M for HiFi-GAN (Casanova et al., 2024).

2. Multilingual and Dialectal Adaptation

XTTS was pre-trained on 27.3k hours of speech across 16 languages, leveraging a single shared BPE vocabulary of 6.6k tokens and romanizing CJK scripts to prevent script drift. No explicit language embedding is used; instead, the model learns pronunciation and prosody via joint text/audio conditioning. Language batch balancing prevents English over-representation (Casanova et al., 2024).

To extend capabilities for Bangla (in BnTTS) and Arabic dialects:

Bangla Adaptation:
- Augments phoneme inventory to include aspirated stops, retroflex consonants, and inherent vowels (Basher et al., 9 Feb 2025).
- Two-stage pretraining—five epochs of partial-audio prompting, one epoch of full prompt—enhances modeling for short and long utterances.
- No architectural changes for language addition; the phoneme set $\mathcal{P}_{bn}$ is merged into the token codebook.
Arabic Dialects Adaptation:
- Vocabulary expanded with 22 new dialect tokens (21 regional, 1 MSA), growing from 6,681 to 6,703 entries (Doan et al., 2024).
- Dialect information injected as token embeddings initialized from $\mathcal{N}(0,1)$ .
- Pseudo-labeling via ensemble voting from eight dialect ID models assigns dialect conditioning.

3. Training Objectives and Optimization

XTTS optimization involves several loss terms, often in parallel channels:

VQ-VAE Loss:

$\mathcal{L}_{VQ\text{-}VAE} = \mathbb{E}_x \left[ \|x - D(\mathrm{quantize}(E(x)))\|^2 \right] + \beta \| \mathrm{sg}(E(x)) - z \|^2 + \| E(x) - \mathrm{sg}(z) \|^2$

Autoregressive Cross-Entropy Loss (GPT-2):

$\mathcal{L}_{CE} = -\sum_{t=1}^T \log P_\theta(y_t | y_{<t}, s, \mathrm{lang}, [\mathrm{dialect}])$

HiFi-GAN Generator Loss (adversarial, feature matching, and mel-reconstruction):

$\mathcal{L}_G = \mathcal{L}_{\text{adv}} + \lambda_{\mathrm{FM}} \mathcal{L}_{\mathrm{FM}} + \lambda_{\mathrm{mel}} \|\mathrm{STFT}(x) - \mathrm{STFT}(\hat{x})\|_1$

Speaker Consistency Loss (SCL, from YourTTS):

$L_{\mathrm{SCL}} = 1 - \cos \left( S(x_{\mathrm{ref}}), S(\hat{x}) \right )$

In few-shot speaker adaptation (BnTTS), all but the audio and speaker encoders are updated on multi-speaker studio data; no explicit speaker classification loss is required (Basher et al., 9 Feb 2025).

4. Evaluation Metrics and Empirical Performance

XTTS and its fine-tuned derivatives are assessed by:

Word/Character Error Rate (WER/CER): $\mathrm{CER} = \frac{S+D+I}{N}$ ; WER computed via Whisper-Large and Jiwer.
Speaker Embedding Cosine Similarity (SECS):

$\mathrm{SECS}(x,y) = \frac{ \langle s(x), s(y) \rangle }{ \| s(x) \| \| s(y) \| }$

Mean Opinion Score (MOS/SMOS): 1–5, 0.5 increments, scored by native annotators.
SpeechBERTScore: Maximum cosine similarity over SSL embeddings between generated and reference speech.
Duration Equality: $1 / \max(a/b, b/a)$ for durations $a, b$ .

Reported results (Doan et al., 2024, Basher et al., 9 Feb 2025, Casanova et al., 2024):

Task / Dataset	Model	WER/CER	SECS	MOS/SMOS
QASR (Arabic, zero-shot)	Baseline XTTS	6.42%	0.755	—
QASR (fine-tuned, w/ dialect)	Fine-tuned	17.96%	0.766	—
In-house dialects	Baseline XTTS	47.16%	0.790	3.61
In-house dialects	Fine-tuned, w/ dialect	62.74%	0.825	3.19
FLORES+ (multi-lang)	XTTS (avg 16 lang.)	CER=2.06	0.50	—
Bangla (BnTTS-n, few-shot)	Ground truth	—	0.548	4.809 (SMOS)
Bangla (BnTTS-n)	Fine-tuned	—	0.586	4.601 (SMOS)

MOS in Bangla attains 4.601 for few-shot adaptation, approaching ground truth (4.809) (Basher et al., 9 Feb 2025). Speaker similarity (SECS) is enhanced by dialect conditioning ( $\Delta$ SECS ≈ 0.035), notably for clean, in-house dialect data (Doan et al., 2024).

5. Zero-Shot and Few-Shot Speaker Adaptation

XTTS operates zero-shot by encoding a short reference utterance through the Perceiver Conditioning Encoder, synthesizing target text in the given speaker's voice with no model updates (Casanova et al., 2024, Basher et al., 9 Feb 2025). The adoption of multi-slice audio conditioning and Speaker Consistency Loss allows accurate speaker mimicry, including cross-lingual and cross-dialect transfer.

In few-shot settings (BnTTS), the pipeline is briefly fine-tuned on specific speaker data (20 min, four speakers, 10 epochs), yielding improved MOS, naturalness, and clarity over baseline zero-shot results (Basher et al., 9 Feb 2025).

Strengths include:

High-fidelity cross-lingual voice cloning without explicit fine-tuning per speaker.
Robust handling of unseen languages/dialects via vocabulary/phoneme augmentation and dialect token injection.

Limitations highlighted:

Increased WER after fine-tuning, suggesting degraded phoneme-to-waveform alignment (Doan et al., 2024).
Gender imbalance in source datasets resulting in reduced voice quality for under-represented groups.
Pseudo-labeling for dialect annotation may introduce noise; properly annotated dialect data could further enhance performance.

6. Training, Deployment, and Model Optimizations

XTTS training leverages mixed GPU parallelism and large-batch gradient accumulation stabilizing large-model optimization, typically on clusters of NVIDIA A100 GPUs. Hyperparameters (e.g., AdamW optimizer, learning rate schedules) are inherited from base XTTS defaults; Bangla-specific models tune hyperparameters to mitigate overrunning on short inputs (Doan et al., 2024, Basher et al., 9 Feb 2025). Inferences are tuned for output diversity and stability via top- $k$ , top- $p$ , temperature, length- and repetition-penalties.

Key optimizations:

Perceiver Resampler: Replaces the single style token paradigm with 32 embeddings for richer speaker/style representation without sequence-length inflation.
Low Frame-Rate Tokenization: At 21.53 Hz (vs. 75 Hz for VALL-E), drastically reduces autoregressive sequence length.
Shared BPE Vocabulary: Enables efficient multilingual pretraining; romanization of non-Latin scripts normalizes input representations.

7. Comparative Analysis and Future Directions

XTTS establishes state-of-the-art (SOTA) performance for zero-shot multilingual TTS, surpassing Mega-TTS 2, StyleTTS 2, and YourTTS in most supported languages on CER and SECS. Subjective MOS and speaker similarity remain competitive but modestly lower than heavily fine-tuned monolingual systems (Casanova et al., 2024).

Proposed future work includes:

End-to-End VQ-VAE Vocoding: Improving the vector-quantized decoder to bypass the external HiFi-GAN stage and enhance synthesis quality.
Disentangled Prosody/Speaker Embeddings: Factoring out prosody from speaker identity for flexible, cross-speaker prosody transfer.

XTTS's extensible architecture, demonstrated by adaptations for Bangla and Arabic dialects, positions the model as a foundation for continued expansion to additional languages and dialects with minimal architectural modifications. The public availability of code and models supports reproducibility and community-driven enhancements (Casanova et al., 2024, Basher et al., 9 Feb 2025, Doan et al., 2024).