VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis (2403.00529v1)

Published 1 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.

References (49)

Authors (5)

Weiwei Lin (33 papers)
Chenhang He (18 papers)
Man-Wai Mak (15 papers)
Jiachen Lian (22 papers)
Kong Aik Lee (77 papers)

Summary

The paper demonstrates a novel unsupervised GAN-based framework that generates diverse speaker voices without relying on supervised speaker models.
It unveils an editable latent space using PCA, enabling precise adjustments of voice attributes such as pitch, tone, and gender.
Extensive experiments show VoxGenesis outperforms existing models in speaker diversity, fidelity, and naturalness, advancing TTS and voice conversion.

Unveiling VoxGenesis: A Novel Framework for Unsupervised Speech Synthesis and Voice Editing

Introducing VoxGenesis

VoxGenesis represents a significant shift in the landscape of speech synthesis and voice generation technologies. By learning to transform a Gaussian distribution into a diverse speech distribution conditioned on semantic tokens, VoxGenesis steps away from the conventional path of deterministic speaker modeling. This novel framework not only enables the generation of novel speakers with distinct characteristics but also the identification and manipulation of semantically meaningful latent directions corresponding to varying speaker attributes.

Key Contributions

VoxGenesis's design is distinctive and multifaceted:

Generative Framework for Voice: At its core, VoxGenesis leverages a conditional Generative Adversarial Network (GAN) framework, circumventing the need for supervised speaker modeling and setting the stage for unsupervised discovery of speaker manifolds.
Editable Latent Space: By employing Principal Component Analysis (PCA) on the latent representations, VoxGenesis unveils interpretable latent directions. This allows for nuanced voice editing across attributes such as pitch, tone, and gender characteristics, without compromising speaker identity.
Versatility in Application: Beyond speaker generation, VoxGenesis exhibits prowess in voice conversion and multi-speaker Text-to-Speech (TTS) tasks, showcasing superior performance in preserving speaker fidelity and speech naturalness.
Extensibility with Speaker Encoders: The framework is compatible with a Gaussian-constrained Neural Factor Analysis (NFA) encoder and Gaussian-constrained discriminative speaker encoders, enabling both novel speaker generation and specific speaker encoding.

Experimental Findings

VoxGenesis was methodically evaluated across multiple dimensions:

Speaker Generation: It consistently outperforms comparative models like TacoSpawn in terms of FID scores, speaker diversity, and speech naturalness, proving its efficacy in generating high-quality, diverse voices.
Editable Latent Space: Experimental evaluations highlight the framework's ability to manipulate latent space for voice editing with minimal impact on speech quality and speaker recognizability. This editable latent space proves consistent for both internal and external speaker representations.
Application Performance: In voice conversion and multi-speaker TTS tasks, VoxGenesis demonstrates notable success, particularly when coupled with the NFA encoder, excelling in speaker fidelity and overall speech quality.

Theoretical and Practical Implications

The advent of VoxGenesis introduces a paradigm shift in how speech synthesis can be approached, moving towards unsupervised learning and editable voice generation. This has profound implications not just for the development of more advanced and customizable speech synthesis systems but also for our understanding of the complex interplay between linguistic content and speaker-specific vocal attributes. Moreover, the framework’s ability to generate and manipulate voices without direct supervision opens new avenues in voice interaction technologies, personalized voice synthesis, and beyond.

Future Directions

While VoxGenesis represents a substantial leap forward, it also opens the door to further exploration. Future work could involve refining the latent space for even more granular control over speaker attributes, expanding the framework to encompass emotional expressivity explicitly, or integrating VoxGenesis with larger, more heterogeneous datasets to explore its scalability and versatility further.

In conclusion, VoxGenesis not only advances the field of speech synthesis by providing a novel unsupervised framework for speaker generation and voice editing but also deepens our understanding of the latent representations of speech, paving the way for exciting developments in voice technology.

Related Papers

Tweets

https://twitter.com/unilightwf/status/1766736588334379479