Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion (2008.12527v1)

Published 28 Aug 2020 in eess.AS and cs.SD

Abstract: The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, including 3 baselines built on the database. From the results of crowd-sourced listening tests, we observed that VC methods have progressed rapidly thanks to advanced deep learning methods. In particular, speaker similarity scores of several systems turned out to be as high as target speakers in the intra-lingual semi-parallel VC task. However, we confirmed that none of them have achieved human-level naturalness yet for the same task. The cross-lingual conversion task is, as expected, a more difficult task, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task. However, we observed encouraging results, and the MOS scores of the best systems were higher than 4.0. We also show a few additional analysis results to aid in understanding cross-lingual VC better.

Authors (8)

Yi Zhao (222 papers)
Wen-Chin Huang (53 papers)
Xiaohai Tian (24 papers)
Junichi Yamagishi (178 papers)
Rohan Kumar Das (50 papers)
Tomi Kinnunen (76 papers)
Zhenhua Ling (21 papers)
Tomoki Toda (106 papers)

Citations (193)

View on Semantic Scholar

Summary

An Analysis of the Voice Conversion Challenge 2020: Advancements and Challenges

The Voice Conversion Challenge 2020 (VCC2020) offers unique insights into the progress and challenges of voice conversion (VC) technologies by organizing a comprehensive evaluation platform. The challenge aimed to compare various VC systems using common datasets across two specific tasks: intra-lingual semi-parallel and cross-lingual voice conversion. Over the course of this evaluation, substantial findings emerged on model performances, revealing improvements in speaker similarity and naturalness, and also highlighting the persistent difficulties in achieving human-like voice synthesis.

Methodology and Tasks

The VCC2020 was structured around two distinctive tasks: intra-lingual semi-parallel VC, where the conversion focuses within the same language with some parallelism in data, and cross-lingual VC, which involves conversion across different languages with completely nonparallel data. The challenge leveraged advanced deep learning approaches including neural vocoders, encoder-decoder networks, and generative adversarial networks (GANs).

A significant innovation in this edition was the inclusion of cross-lingual VC, posing a formidable test for the systems due to the lack of parallel linguistic information between the source and target languages. The datasets were meticulously curated to include English speech for both tasks and Finnish, German, and Mandarin for cross-lingual VC, facilitating the exploration of VC across diverse linguistic contexts.

Evaluation and Results

The VCC2020 received 33 submissions with systems demonstrating a wide array of methodologies. These ranged from phonetic posteriorgrams (PPG)-based methods to those combining automated speech recognition (ASR) with text-to-speech (TTS) systems. The systems were evaluated on naturalness and speaker similarity using perceptual listening tests conducted with both native and non-native English speakers.

For intra-lingual VC, significant improvements were recorded in speaker similarity, with certain systems attaining human-level performance. Notably, the systems employing PPG-based and ASR-TTS combined approaches generally outperformed in this task. However, despite advancements, none of the systems achieved human-like naturalness, indicating continual room for enhancing audio quality.

The cross-lingual VC task, as anticipated, proved more challenging. While none of the systems matched human-level speaker similarity or naturalness, the top systems demonstrated MOS scores exceeding 4. This task further accentuated the divergence in system performance, particularly emphasizing the superiority of PPG-based models over ASR-TTS systems for cross-lingual application.

Implications and Future Directions

The outcomes of VCC2020 reinforce the notion that while VC technologies have made great strides, especially in speaker similarity, achieving naturalness akin to human speech remains elusive. PPG-based models have shown potential, especially in cross-lingual contexts, highlighting a promising direction for VC research. Furthermore, the disparity in human and system judgments for L2 languages underscores the complexity of cross-lingual VC, underlining the need for more sophisticated linguistic feature translation and synthesis techniques.

These insights are integral for guiding future research, particularly focusing on advanced transformer models and more intricate neural architectures that can better capture the nuances of voice characteristics across languages. Additionally, the role of subjective evaluation in assessing system performance highlights the necessity of diverse and comprehensive datasets that account for linguistic variability.

In conclusion, the VCC2020 has played a crucial role in mapping the current capabilities and addressing the challenges faced by VC systems. It provides a foundation for enhancing VC technologies, especially towards achieving natural, human-like voice conversion across diverse linguistic scenarios. Future endeavors must prioritize overcoming the bottleneck in audio naturalness while expanding the applicability of VC to multilingual contexts.

PDF Markdown