Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The THU-HCSI Multi-Speaker Multi-Lingual Few-Shot Voice Cloning System for LIMMITS'24 Challenge (2404.16619v1)

Published 25 Apr 2024 in cs.SD and eess.AS

Abstract: This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)
  1. “LIMMITS’24: Multi-speaker, Multi-lingual Indic TTS with voice cloning,” submitted to ICASSP 2024, 2024, https://sites.google.com/view/limmits24/.
  2. “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
  3. “VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design,” in Proc. INTERSPEECH 2023, 2023, pp. 4374–4378.
  4. “FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7857–7861.
  5. “Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation,” arXiv preprint arXiv:2210.15868, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yixuan Zhou (30 papers)
  2. Shuoyi Zhou (4 papers)
  3. Shun Lei (21 papers)
  4. Zhiyong Wu (171 papers)
  5. Menglin Wu (3 papers)