MulliVC: Multi-lingual Voice Conversion With Cycle Consistency (2408.04708v1)

Published 8 Aug 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system's efficacy and the viability of the three-step approach with cycle consistency. Audio samples can be found on our demo page (mullivc.github.io).

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel three-step cycle-consistent method that effectively separates timbre from content for multi-lingual voice conversion.
It achieves superior speaker similarity and intelligibility, with a SIM score of 0.395 and a WER of 2.24 on the VCTK dataset.
The approach generalizes well to unseen languages, opening new applications in multilingual communication and personalized speech synthesis.

Essay on MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Introduction

Voice conversion (VC) is a critical area in speech processing that endeavors to transform a source speaker's voice so that it resembles a target speaker while retaining the original speech content. The advent of advanced speech representations and sophisticated synthesis models has significantly propelled the field. Traditionally, VC systems employ parallel or non-parallel learning techniques based on the available training data. The latter, which avoids the need for speakers to utter identical phrases, dominates current trends due to practical constraints in data acquisition. However, multi-lingual voice conversion, encompassing both monolingual and cross-lingual scenarios, presents formidable challenges predominantly due to language-specific prosody and articulation variances, and the scarcity of paired multi-lingual datasets from the same speaker.

The Proposed Approach: MulliVC

The authors introduce MulliVC, a novel architecture that addresses the challenges of multi-lingual voice conversion without relying on paired data from bilingual speakers. The core innovation lies in a three-step training cycle that ensures the disentanglement of timbre from other speech attributes like content and prosody. Here's a succinct breakdown of the training methodology incorporated in MulliVC:

Monolingual Step:
- The algorithm first processes monolingual speech data, where it synthesizes speech by leveraging content and timbre features originating from the same language and speaker.
Cross-lingual Cycle Step 1:
- This step involves content and timbre features derived from different languages. By forcing the model to align its output towards maintaining the timbre of one language while content derives from another, the system's capacity in timbre disentanglement and cross-lingual voice conversion is enhanced.
Cross-lingual Cycle Step 2:
- The speech rendered in the previous step serves as a new input to reconstruct speech combining the newly learned timbre with the content in the original language, thereby reinforcing cycle consistency.

Results and Evaluation

The paper involved comprehensive evaluations across three datasets: VCTK (English), Aishell-1 (Chinese Mandarin), and EMIME (bilingual English-Chinese). The authors conducted both subjective evaluations (nMOS and sMOS scores) and objective metrics (WER, CER, and speaker similarity). MulliVC consistently outperformed baseline models such as FreeVC and ConsistencyVC in multiple metrics demonstrating:

Speaker Similarity (SIM): MulliVC achieved higher speaker similarity scores denoting superior timbre adaptation. For instance, the system's SIM score was 0.395 on the VCTK dataset, compared to 0.376 by FreeVC, illustrating its robustness in retaining speaker-specific characteristics.
Intelligibility: The model showcased substantial improvements with a WER of 2.24 on the VCTK dataset, proving its ability to preserve linguistic content effectively.

On unseen languages (French and German), MulliVC exhibited superior generalization capabilities further substantiating the efficacy of the proposed cross-lingual training strategy.

Theoretical and Practical Implications

Theoretical advancements stem from the innovative training strategy leveraging cycle consistency which enhances timbre disentanglement and cross-lingual adaptability. Practically, the efficacy of MulliVC in zero-shot scenarios opens avenues for applications in diverse linguistic environments without requiring extensive paired datasets. This has significant implications for multilingual communication systems, entertainment industry for dubbing, and personalized speech synthesis.

Future Directions

Future research may focus on expanding the training corpora to incorporate more diverse languages and expressive speech data, thus addressing the current dataset limitations. Enhancements to the content encoder to minimize overlap with prosody and timbre information may also contribute to more refined voice conversion results. Moreover, exploring domain adaptation techniques to fine-tune pre-trained SV models could further improve their accuracy and, consequently, the overall performance of MulliVC.

Conclusion

MulliVC represents a noteworthy advance in multi-lingual voice conversion utilizing cycle consistency for effective timbre and content disentanglement. The robust performance across various datasets underscores its practical applicability and theoretical significance, marking a step forward in the field of advanced speech synthesis and conversion systems. The promising results of MulliVC not only reinforce the potential of nuanced training strategies in overcoming data limitations but also set the stage for future explorations in the domain of AI-driven voice processing technologies.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (9)

Tweets

https://twitter.com/_akhaliq/status/1822834333184324096

https://twitter.com/fly51fly/status/1823112676253565284

https://twitter.com/gm8xx8/status/1822839094575010212

https://twitter.com/arXivGPT/status/1823470219719008595