Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training (2201.08124v1)

Published 20 Jan 2022 in cs.SD, cs.AI, and eess.AS

Abstract: In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (2)

J. Yang (281 papers)
Lei He (120 papers)

Citations (10)

View on Semantic Scholar

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training (2201.08124v1)

Related Papers