Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone (2112.02418v4)

Published 4 Dec 2021 in cs.SD, cs.CL, and eess.AS

Abstract: YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Edresson Casanova (20 papers)
  2. Julian Weber (5 papers)
  3. Christopher Shulby (7 papers)
  4. Eren Gölge (7 papers)
  5. Moacir Antonelli Ponti (20 papers)
  6. Arnaldo Candido Junior (16 papers)
Citations (328)

Summary

  • The paper presents a state-of-the-art YourTTS model that excels in zero-shot multi-speaker TTS and voice conversion by outperforming previous methods on the VCTK dataset.
  • The paper demonstrates novel multilingual zero-shot capabilities, enabling natural-sounding speech synthesis in low-resource languages with minimal training data.
  • The paper achieves efficient speaker adaptation by fine-tuning with less than a minute of data, effectively supporting personalized voice synthesis and cross-lingual conversion.

Overview of the YourTTS Paper

The paper, "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone," presents an advanced approach for tackling the challenges in zero-shot multi-speaker Text-to-Speech (TTS) and voice conversion tasks. The authors introduce the YourTTS model, which expands upon the existing VITS model with critical innovations designed to enhance zero-shot learning for multilingual multi-speaker scenarios. The paper situates itself within the TTS domain, where synthesizing speech for new, previously unseen speakers in a zero-shot framework is a pressing challenge.

Contributions and Methodology

The paper details several key contributions:

  1. State-of-the-Art Performance in Zero-Shot Multi-Speaker TTS for English: The YourTTS model achieves state-of-the-art results on the VCTK dataset, outperforming prior models like Attentron and SC-GlowTTS. This is a notable advancement in handling previously unseen speaker data, producing natural-sounding speech in the English language.
  2. Multilingual Zero-Shot Capabilities: For the first time in this domain, the authors propose a multilingual approach, demonstrating YourTTS's ability to synthesize speech in a target language with minimal training exposure, i.e., using a single-speaker dataset. This innovation extends the utility of YourTTS to low-resource languages, a significant step towards universal TTS solutions.
  3. Efficient Speaker Adaptation: The YourTTS model allows fine-tuning with less than a minute of additional data, adapting successfully to speakers with distinct vocal characteristics and recording conditions not represented in the initial training datasets.

Experimental Findings

The paper reports on extensive experimentation across several datasets, including VCTK for English, TTS-Portuguese for Portuguese, and fr_FR from the M-AILABS dataset for French, integrating multiple languages to evaluate YourTTS's cross-lingual proficiency. The metrics used include Mean Opinion Scores (MOS) for quality assessment and Speaker Encoder Cosine Similarity (SECS) alongside Similarity MOS (Sim-MOS) for measuring voice similarity.

  1. VCTK Dataset Results: In English, YourTTS achieved a high MOS and Sim-MOS comparable to ground truth recordings, affirming its state-of-the-art status in zero-shot settings. The use of the Speaker Consistency Loss (SCL) improved SECS scores, though its effect on Sim-MOS was less conclusive.
  2. Multilingual and Low-Resource Language Performance: For Portuguese, the model demonstrated noteworthy MOS and Sim-MOS, especially given the limited data. Experimentation with French was less emphasized due to dataset limitations but provided preliminary insights into multilingual training's impact.
  3. Applications in Voice Conversion: YourTTS also facilitates cross-lingual zero-shot voice conversion, achieving MOS and Sim-MOS comparable to ground truth for intra-lingual conversions. The authors highlight challenges in converting between languages like English and Portuguese, particularly concerning gender discrepancies due to the lack of female voice data in training.

Implications and Future Directions

The YourTTS framework promises considerable practical applications, including the potential for developing robust TTS systems in low-resource languages. It may significantly benefit industries relying on multilingual voice synthesis and personalization of voice assistants and other vocal applications. However, limitations remain, such as instability in duration prediction and pronunciation challenges when phonetic input is not used.

Future research should focus on refining the stochastic duration predictor and expanding the linguistic diversity of training datasets to enhance the model's robustness. Furthermore, exploring the integration of phonetic information could alleviate pronunciation issues. The adaptation capabilities of YourTTS provide an exciting avenue for further innovation, potentially reducing the data requirements for high-quality speaker adaptation drastically.

In summary, YourTTS represents a substantial contribution to the TTS field, advancing the capabilities of zero-shot learning in a multilingual context while highlighting the potential for real-world applications in diverse linguistic environments.

Youtube Logo Streamline Icon: https://streamlinehq.com