Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis (2403.00529v1)

Published 1 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. Advances in Neural Information Processing Systems, 2020.
  2. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
  3. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  5. Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp. 2709–2720, 2022.
  6. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  7. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems, 34:16251–16265, 2021.
  8. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  9. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  10. Deep voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems, 30, 2017.
  11. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
  12. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33:9841–9850, 2020.
  13. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
  14. Denoising diffusion probabilistic models. Advances in Neural Information Processing systems, 33:6840–6851, 2020.
  15. Unsupervised learning of disentangled and interpretable representations from sequential data. Advances in Neural Information Processing systems, 30, 2017.
  16. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021.
  17. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in Neural Information Processing Systems, 31, 2018a.
  18. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems, pp. 4485–4495, 2018b.
  19. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  6820–6824, 2019.
  20. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  21. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp. 5530–5540. PMLR, 2021.
  22. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  23. Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems, 31, 2018.
  24. Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802, 2023.
  25. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  26. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
  27. High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023.
  28. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
  29. Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations. In Proceedings of the International Conference on Machine Learning, 2023.
  30. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  31. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
  32. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  33. Waveglow: A flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  3617–3621, 2019.
  34. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pp. 5210–5219. PMLR, 2019.
  35. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pp. 7836–7846. PMLR, 2020.
  36. SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
  37. Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019.
  38. wav2vec: Unsupervised pre-training for speech recognition. In Proc. Interspeech 2019, pp.  3465–3469, 2019.
  39. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In International Conference on Acoustics, Speech and Signal Processing, pp.  4779–4783, 2018.
  40. X-vectors: Robust DNN embeddings for speaker recognition. In International Conference on Acoustics, Speech and Signal Processing, pp.  5329–5333, 2018.
  41. Speaker generation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  7897–7901, 2022.
  42. Naturalspeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
  43. Unsupervised discovery of interpretable directions in the gan latent space. In International Conference on Machine Learning, pp. 9786–9796. PMLR, 2020a.
  44. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning, pp. 9786–9796. PMLR, 2020b.
  45. Neural codec language models are zero-shot text to speech synthesizers, 2023. URL: https://arxiv. org/abs/2301.02111. doi: doi, 10.
  46. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, pp. 5180–5189. PMLR, 2018.
  47. Dsvae: Interpretable disentangled representation for synthetic speech detection. arXiv preprint arXiv:2304.03323, 2023.
  48. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  49. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weiwei Lin (33 papers)
  2. Chenhang He (18 papers)
  3. Man-Wai Mak (15 papers)
  4. Jiachen Lian (22 papers)
  5. Kong Aik Lee (77 papers)

Summary

  • The paper demonstrates a novel unsupervised GAN-based framework that generates diverse speaker voices without relying on supervised speaker models.
  • It unveils an editable latent space using PCA, enabling precise adjustments of voice attributes such as pitch, tone, and gender.
  • Extensive experiments show VoxGenesis outperforms existing models in speaker diversity, fidelity, and naturalness, advancing TTS and voice conversion.

Unveiling VoxGenesis: A Novel Framework for Unsupervised Speech Synthesis and Voice Editing

Introducing VoxGenesis

VoxGenesis represents a significant shift in the landscape of speech synthesis and voice generation technologies. By learning to transform a Gaussian distribution into a diverse speech distribution conditioned on semantic tokens, VoxGenesis steps away from the conventional path of deterministic speaker modeling. This novel framework not only enables the generation of novel speakers with distinct characteristics but also the identification and manipulation of semantically meaningful latent directions corresponding to varying speaker attributes.

Key Contributions

VoxGenesis's design is distinctive and multifaceted:

  • Generative Framework for Voice: At its core, VoxGenesis leverages a conditional Generative Adversarial Network (GAN) framework, circumventing the need for supervised speaker modeling and setting the stage for unsupervised discovery of speaker manifolds.
  • Editable Latent Space: By employing Principal Component Analysis (PCA) on the latent representations, VoxGenesis unveils interpretable latent directions. This allows for nuanced voice editing across attributes such as pitch, tone, and gender characteristics, without compromising speaker identity.
  • Versatility in Application: Beyond speaker generation, VoxGenesis exhibits prowess in voice conversion and multi-speaker Text-to-Speech (TTS) tasks, showcasing superior performance in preserving speaker fidelity and speech naturalness.
  • Extensibility with Speaker Encoders: The framework is compatible with a Gaussian-constrained Neural Factor Analysis (NFA) encoder and Gaussian-constrained discriminative speaker encoders, enabling both novel speaker generation and specific speaker encoding.

Experimental Findings

VoxGenesis was methodically evaluated across multiple dimensions:

  • Speaker Generation: It consistently outperforms comparative models like TacoSpawn in terms of FID scores, speaker diversity, and speech naturalness, proving its efficacy in generating high-quality, diverse voices.
  • Editable Latent Space: Experimental evaluations highlight the framework's ability to manipulate latent space for voice editing with minimal impact on speech quality and speaker recognizability. This editable latent space proves consistent for both internal and external speaker representations.
  • Application Performance: In voice conversion and multi-speaker TTS tasks, VoxGenesis demonstrates notable success, particularly when coupled with the NFA encoder, excelling in speaker fidelity and overall speech quality.

Theoretical and Practical Implications

The advent of VoxGenesis introduces a paradigm shift in how speech synthesis can be approached, moving towards unsupervised learning and editable voice generation. This has profound implications not just for the development of more advanced and customizable speech synthesis systems but also for our understanding of the complex interplay between linguistic content and speaker-specific vocal attributes. Moreover, the framework’s ability to generate and manipulate voices without direct supervision opens new avenues in voice interaction technologies, personalized voice synthesis, and beyond.

Future Directions

While VoxGenesis represents a substantial leap forward, it also opens the door to further exploration. Future work could involve refining the latent space for even more granular control over speaker attributes, expanding the framework to encompass emotional expressivity explicitly, or integrating VoxGenesis with larger, more heterogeneous datasets to explore its scalability and versatility further.

In conclusion, VoxGenesis not only advances the field of speech synthesis by providing a novel unsupervised framework for speaker generation and voice editing but also deepens our understanding of the latent representations of speech, paving the way for exciting developments in voice technology.

X Twitter Logo Streamline Icon: https://streamlinehq.com