Zero-shot Cross-lingual Voice Transfer for TTS (2409.13910v1)

Published 20 Sep 2024 in eess.AS and cs.SD

Abstract: In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a zero-shot voice transfer module that integrates with TTS systems to create high-fidelity, cross-lingual synthesized speech from a single reference sample.
It employs several bottleneck layers, including VAE, SharedGST, and SegmentGST, to optimize speaker similarity and naturalness across languages.
The study demonstrates practical applications in voice restoration and enhanced accessibility, benefiting both typical and atypical speech users.

Zero-shot Cross-lingual Voice Transfer for TTS

"Zero-shot Cross-lingual Voice Transfer for TTS" by Biadsy et al. addresses the challenge of transferring an individual's voice to synthesized speech in multiple languages using a zero-shot learning approach. The research introduces a novel Voice Transfer (VT) module that can be seamlessly incorporated into existing multilingual Text-to-Speech (TTS) systems, enhancing the ability to produce high-quality, cross-lingual synthesized speech using only a short reference speech sample from an unseen speaker.

Key Contributions

The paper's primary contributions are multifaceted:

VT Module Integration: The authors describe a zero-shot VT module designed for easy integration into state-of-the-art TTS systems. This module leverages only a brief, single speech reference to generate high-fidelity voice transfer across various languages.
Cross-lingual Capability: The VT module showcases its proficiency in transferring voices even when the reference speech and target speech are in different languages.
Bottleneck Layers: The paper introduces and evaluates various bottleneck layers within the VT module, including VAE, SharedGST, MultiGST, and SegmentGST, showing their impact on TTS performance and speaker similarity.
Voice Restoration: Demonstrated is the model's capability to restore voices for individuals with atypical speech, providing potentially invaluable utility to people who have not banked their voice or never had typical speech.

Methodology

The VT module comprises a speaker-encoder, a bottleneck layer, and residual adapters, interfacing with the TTS system's preexisting layers. The research team explored different bottleneck layer configurations to determine their impact on TTS quality. They utilized techniques such as the Variational Autoencoder (VAE) and Global Style Tokens (GST) to capture high-level voice representations, constrained within an embedding space to ensure consistent and high-quality outputs.

Experimental Evaluation

The authors conducted extensive experimental evaluations, employing datasets from various sources, including the VCTK corpus for typical speech and the Euphonia corpus for atypical speech. The key metrics used for evaluation were Mean Opinion Score (MOS) for naturalness and human-judged speaker similarity.

Findings:

Typical Speech: Across nine languages, the SegmentGST bottleneck achieved the highest MOS (average 3.89) and a speaker similarity of 73%, indicating effective cross-lingual voice transfer.
Atypical Speech: For reference speech from individuals with dysarthric conditions, the SharedGST and MultiGST bottleneck layers significantly enhanced intelligibility and maintained low Word Error Rates (WER).

Implications

The research has both practical and theoretical implications:

Practical Applications: The VT module can be utilized for applications in personal voice assistants, automated customer service, and accessibility tools for individuals with speech impairments. It allows for high-quality voice synthesis across languages, making technology more inclusive.
Voice Restoration: The model's ability to restore voices for individuals with atypical speech offers significant benefits for those with degenerative diseases or congenital conditions affecting speech, granting them a sense of normalcy and improved communication capabilities.
Theoretical Insights: The findings on bottleneck layer configurations contribute to the understanding of effective voice representation in zero-shot learning contexts, potentially guiding future developments in TTS and VC systems.

Future Directions

Future research could explore the following avenues based on the findings:

Enhanced Bottleneck Designs: Further refinement of bottleneck layers could enhance speaker similarity and naturalness, especially for atypical reference speech.
Robustness Across More Languages: Expanding the system to support more languages, especially those with less available training data, could improve the inclusivity and usability of VT systems.
Mitigating Misuse: Investigating and advancing audio watermarking techniques to robustly prevent misuse of synthesized voices is crucial for protecting individuals' identities and preventing fraudulent use.

In sum, Biadsy et al.'s work offers significant advancements in TTS technology, providing robust cross-lingual voice transfer capabilities and opening new possibilities for accessible and inclusive speech technology. The demonstrated potential in voice restoration underscores the societal value of such advancements, particularly for individuals with impaired or atypical speech.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1838693367032463490

https://twitter.com/arXivGPT/status/1839056681675928013

https://twitter.com/susumuota/status/1842359336451879278