Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages (2401.13851v2)

Published 24 Jan 2024 in cs.SD, cs.LG, and eess.AS

Abstract: In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions. RAD-MMM performs competitively on Tracks 1 and 2, while P-Flow ranks first on Track 3, with mean opinion score (MOS) 4.4 and speaker similarity score (SMOS) of 3.62.

Authors (5)

Akshit Arora (2 papers)
Rohan Badlani (13 papers)
Sungwon Kim (32 papers)
Rafael Valle (31 papers)
Bryan Catanzaro (123 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ArxivSound/status/1750701018034196992

Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages (2401.13851v2)

Summary

Related Papers

Tweets