Handling overlapping character sets in multilingual TTS

Determine effective strategies to prevent performance degradation in CosyVoice 2’s multilingual text-to-speech synthesis for languages with overlapping character sets (e.g., Chinese–Japanese overlap), ensuring accurate pronunciation and naturalness under such conditions.

Background

The paper introduces CosyVoice 2, a streaming-capable zero-shot TTS system trained on large-scale multilingual data. While it achieves near human-parity in many settings, the authors note that performance varies across languages.

In the evaluation of Japanese and Korean benchmarks, the authors attribute weaker Japanese results partly to the overlap between Japanese and Chinese character sets, which led to Chinese pronunciations in Japanese contexts. This motivates a broader concern about handling cross-language character ambiguity in multilingual TTS.

The Limitations section explicitly highlights this issue as an open challenge, indicating the need for methods that robustly address overlapping character sets to maintain synthesis quality across languages.

References

For languages with overlapping character sets, synthesis performance may degrade, presenting an open challenge for future research.

— CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (2412.10117 - Du et al., 2024) in Section Limitations

Handling overlapping character sets in multilingual TTS

Background

References

Related Problems