Papers
Topics
Authors
Recent
Search
2000 character limit reached

XTTS: Multilingual Zero-Shot TTS

Updated 31 January 2026
  • XTTS is a zero-shot text-to-speech model that synthesizes high-fidelity speech across multiple languages and dialects without requiring speaker-specific fine-tuning.
  • It integrates a VQ-VAE encoder/decoder, GPT-2 based autoregressive prior, and HiFi-GAN vocoder to optimize sequence modeling and enhance voice cloning capabilities.
  • Robust performance in low- and medium-resource languages is achieved through phoneme augmentation and dialect token injection, ensuring broad linguistic coverage.

XTTS is an extensible, massively multilingual zero-shot text-to-speech (ZS-TTS) model designed to synthesize high-fidelity speech in a large variety of speaker identities and languages without explicit speaker- or language-specific fine-tuning. Built on the Tortoise TTS framework and integrating advances from YourTTS, VALL-E X, Mega-TTS 2, and Voicebox, XTTS introduces architectural modifications and training optimizations enabling robust voice-cloning, rapid inference, and broad language support—critically extending coverage to low- and medium-resource languages such as Bangla and Arabic dialects (Casanova et al., 2024, &&&1&&&, Doan et al., 2024).

1. Core Architecture

XTTS utilizes a five-component pipeline inherited from Tortoise, but leverages three principal modules for synthesis:

  • VQ-VAE Audio Encoder/Decoder: Converts raw waveform inputs (or mel-spectrograms MRT×F\mathbf{M}\in\mathbb{R}^{T\times F}) to continuous latent vectors, discretizes via a learned codebook, and reconstructs x^=D(z)\hat{\mathbf{x}} = D(z).
  • GPT-2-Based Autoregressive Prior: Acts as a joint “acoustic LLM,” autoregressively predicting both text tokens (t1tnt_1\ldots t_n) and discrete audio tokens (a1ala_1\ldots a_l) using self-attention over concatenated linguistic and acoustic streams. Speaker identity is injected via a Perceiver Resampler producing a fixed-length speaker embedding h\mathbf{h}.
  • HiFi-GAN Vocoder: Decodes latent representations or predicted audio tokens into high-fidelity waveforms, further conditioned on speaker embeddings.

The input sequence structure is:

1
[BOS_SPK], s_1…s_k, [BOS_TTS], [lang], ([dialect]), t_1…t_n, [EOS_TTS], [BOS_AUD], a_1…a_l, [EOS_AUD]

A detailed schematic is summarized in the table below:

Module Function Key Innovations
VQ-VAE Quantizes waveform/mel into codebook tokens Aggressive codebook pruning
GPT-2 Autoregressive Prior Models text+audio code sequences Perceiver multi-slice prompt
HiFi-GAN Vocoder Generates waveform from tokens+embedding Speaker embedding addition

XTTS parameterization includes 13M for VQ-VAE, 443M for GPT-2 prior, and 26M for HiFi-GAN (Casanova et al., 2024).

2. Multilingual and Dialectal Adaptation

XTTS was pre-trained on 27.3k hours of speech across 16 languages, leveraging a single shared BPE vocabulary of 6.6k tokens and romanizing CJK scripts to prevent script drift. No explicit language embedding is used; instead, the model learns pronunciation and prosody via joint text/audio conditioning. Language batch balancing prevents English over-representation (Casanova et al., 2024).

To extend capabilities for Bangla (in BnTTS) and Arabic dialects:

  • Bangla Adaptation:
    • Augments phoneme inventory to include aspirated stops, retroflex consonants, and inherent vowels (Basher et al., 9 Feb 2025).
    • Two-stage pretraining—five epochs of partial-audio prompting, one epoch of full prompt—enhances modeling for short and long utterances.
    • No architectural changes for language addition; the phoneme set Pbn\mathcal{P}_{bn} is merged into the token codebook.
  • Arabic Dialects Adaptation:
    • Vocabulary expanded with 22 new dialect tokens (21 regional, 1 MSA), growing from 6,681 to 6,703 entries (Doan et al., 2024).
    • Dialect information injected as token embeddings initialized from N(0,1)\mathcal{N}(0,1).
    • Pseudo-labeling via ensemble voting from eight dialect ID models assigns dialect conditioning.

3. Training Objectives and Optimization

XTTS optimization involves several loss terms, often in parallel channels:

  • VQ-VAE Loss:

LVQ-VAE=Ex[xD(quantize(E(x)))2]+βsg(E(x))z2+E(x)sg(z)2\mathcal{L}_{VQ\text{-}VAE} = \mathbb{E}_x \left[ \|x - D(\mathrm{quantize}(E(x)))\|^2 \right] + \beta \| \mathrm{sg}(E(x)) - z \|^2 + \| E(x) - \mathrm{sg}(z) \|^2

  • Autoregressive Cross-Entropy Loss (GPT-2):

LCE=t=1TlogPθ(yty<t,s,lang,[dialect])\mathcal{L}_{CE} = -\sum_{t=1}^T \log P_\theta(y_t | y_{<t}, s, \mathrm{lang}, [\mathrm{dialect}])

  • HiFi-GAN Generator Loss (adversarial, feature matching, and mel-reconstruction):

LG=Ladv+λFMLFM+λmelSTFT(x)STFT(x^)1\mathcal{L}_G = \mathcal{L}_{\text{adv}} + \lambda_{\mathrm{FM}} \mathcal{L}_{\mathrm{FM}} + \lambda_{\mathrm{mel}} \|\mathrm{STFT}(x) - \mathrm{STFT}(\hat{x})\|_1

  • Speaker Consistency Loss (SCL, from YourTTS):

LSCL=1cos(S(xref),S(x^))L_{\mathrm{SCL}} = 1 - \cos \left( S(x_{\mathrm{ref}}), S(\hat{x}) \right )

In few-shot speaker adaptation (BnTTS), all but the audio and speaker encoders are updated on multi-speaker studio data; no explicit speaker classification loss is required (Basher et al., 9 Feb 2025).

4. Evaluation Metrics and Empirical Performance

XTTS and its fine-tuned derivatives are assessed by:

  • Word/Character Error Rate (WER/CER): CER=S+D+IN\mathrm{CER} = \frac{S+D+I}{N}; WER computed via Whisper-Large and Jiwer.
  • Speaker Embedding Cosine Similarity (SECS):

SECS(x,y)=s(x),s(y)s(x)s(y)\mathrm{SECS}(x,y) = \frac{ \langle s(x), s(y) \rangle }{ \| s(x) \| \| s(y) \| }

  • Mean Opinion Score (MOS/SMOS): 1–5, 0.5 increments, scored by native annotators.
  • SpeechBERTScore: Maximum cosine similarity over SSL embeddings between generated and reference speech.
  • Duration Equality: 1/max(a/b,b/a)1 / \max(a/b, b/a) for durations a,ba, b.

Reported results (Doan et al., 2024, Basher et al., 9 Feb 2025, Casanova et al., 2024):

Task / Dataset Model WER/CER SECS MOS/SMOS
QASR (Arabic, zero-shot) Baseline XTTS 6.42% 0.755
QASR (fine-tuned, w/ dialect) Fine-tuned 17.96% 0.766
In-house dialects Baseline XTTS 47.16% 0.790 3.61
In-house dialects Fine-tuned, w/ dialect 62.74% 0.825 3.19
FLORES+ (multi-lang) XTTS (avg 16 lang.) CER=2.06 0.50
Bangla (BnTTS-n, few-shot) Ground truth 0.548 4.809 (SMOS)
Bangla (BnTTS-n) Fine-tuned 0.586 4.601 (SMOS)

MOS in Bangla attains 4.601 for few-shot adaptation, approaching ground truth (4.809) (Basher et al., 9 Feb 2025). Speaker similarity (SECS) is enhanced by dialect conditioning (Δ\DeltaSECS ≈ 0.035), notably for clean, in-house dialect data (Doan et al., 2024).

5. Zero-Shot and Few-Shot Speaker Adaptation

XTTS operates zero-shot by encoding a short reference utterance through the Perceiver Conditioning Encoder, synthesizing target text in the given speaker's voice with no model updates (Casanova et al., 2024, Basher et al., 9 Feb 2025). The adoption of multi-slice audio conditioning and Speaker Consistency Loss allows accurate speaker mimicry, including cross-lingual and cross-dialect transfer.

In few-shot settings (BnTTS), the pipeline is briefly fine-tuned on specific speaker data (20 min, four speakers, 10 epochs), yielding improved MOS, naturalness, and clarity over baseline zero-shot results (Basher et al., 9 Feb 2025).

Strengths include:

  • High-fidelity cross-lingual voice cloning without explicit fine-tuning per speaker.
  • Robust handling of unseen languages/dialects via vocabulary/phoneme augmentation and dialect token injection.

Limitations highlighted:

  • Increased WER after fine-tuning, suggesting degraded phoneme-to-waveform alignment (Doan et al., 2024).
  • Gender imbalance in source datasets resulting in reduced voice quality for under-represented groups.
  • Pseudo-labeling for dialect annotation may introduce noise; properly annotated dialect data could further enhance performance.

6. Training, Deployment, and Model Optimizations

XTTS training leverages mixed GPU parallelism and large-batch gradient accumulation stabilizing large-model optimization, typically on clusters of NVIDIA A100 GPUs. Hyperparameters (e.g., AdamW optimizer, learning rate schedules) are inherited from base XTTS defaults; Bangla-specific models tune hyperparameters to mitigate overrunning on short inputs (Doan et al., 2024, Basher et al., 9 Feb 2025). Inferences are tuned for output diversity and stability via top-kk, top-pp, temperature, length- and repetition-penalties.

Key optimizations:

  • Perceiver Resampler: Replaces the single style token paradigm with 32 embeddings for richer speaker/style representation without sequence-length inflation.
  • Low Frame-Rate Tokenization: At 21.53 Hz (vs. 75 Hz for VALL-E), drastically reduces autoregressive sequence length.
  • Shared BPE Vocabulary: Enables efficient multilingual pretraining; romanization of non-Latin scripts normalizes input representations.

7. Comparative Analysis and Future Directions

XTTS establishes state-of-the-art (SOTA) performance for zero-shot multilingual TTS, surpassing Mega-TTS 2, StyleTTS 2, and YourTTS in most supported languages on CER and SECS. Subjective MOS and speaker similarity remain competitive but modestly lower than heavily fine-tuned monolingual systems (Casanova et al., 2024).

Proposed future work includes:

  • End-to-End VQ-VAE Vocoding: Improving the vector-quantized decoder to bypass the external HiFi-GAN stage and enhance synthesis quality.
  • Disentangled Prosody/Speaker Embeddings: Factoring out prosody from speaker identity for flexible, cross-speaker prosody transfer.

XTTS's extensible architecture, demonstrated by adaptations for Bangla and Arabic dialects, positions the model as a foundation for continued expansion to additional languages and dialects with minimal architectural modifications. The public availability of code and models supports reproducibility and community-driven enhancements (Casanova et al., 2024, Basher et al., 9 Feb 2025, Doan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XTTS Model.