VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis (2403.00529v1)
Abstract: Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. Advances in Neural Information Processing Systems, 2020.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp. 2709–2720, 2022.
- Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
- Neural analysis and synthesis: Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems, 34:16251–16265, 2021.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Deep voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems, 30, 2017.
- Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33:9841–9850, 2020.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing systems, 33:6840–6851, 2020.
- Unsupervised learning of disentangled and interpretable representations from sequential data. Advances in Neural Information Processing systems, 30, 2017.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021.
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in Neural Information Processing Systems, 31, 2018a.
- Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems, pp. 4485–4495, 2018b.
- CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6820–6824, 2019.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pp. 5530–5540. PMLR, 2021.
- Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Glow: Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems, 31, 2018.
- Libritts-r: A restored multi-speaker text-to-speech corpus. arXiv preprint arXiv:2305.18802, 2023.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
- High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023.
- Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
- Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations. In Proceedings of the International Conference on Machine Learning, 2023.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
- Waveglow: A flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3617–3621, 2019.
- AutoVC: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pp. 5210–5219. PMLR, 2019.
- Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pp. 7836–7846. PMLR, 2020.
- SpeechBrain: A general-purpose speech toolkit, 2021. arXiv:2106.04624.
- Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019.
- wav2vec: Unsupervised pre-training for speech recognition. In Proc. Interspeech 2019, pp. 3465–3469, 2019.
- Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In International Conference on Acoustics, Speech and Signal Processing, pp. 4779–4783, 2018.
- X-vectors: Robust DNN embeddings for speaker recognition. In International Conference on Acoustics, Speech and Signal Processing, pp. 5329–5333, 2018.
- Speaker generation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7897–7901, 2022.
- Naturalspeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
- Unsupervised discovery of interpretable directions in the gan latent space. In International Conference on Machine Learning, pp. 9786–9796. PMLR, 2020a.
- Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning, pp. 9786–9796. PMLR, 2020b.
- Neural codec language models are zero-shot text to speech synthesizers, 2023. URL: https://arxiv. org/abs/2301.02111. doi: doi, 10.
- Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning, pp. 5180–5189. PMLR, 2018.
- Dsvae: Interpretable disentangled representation for synthetic speech detection. arXiv preprint arXiv:2304.03323, 2023.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
- Weiwei Lin (33 papers)
- Chenhang He (18 papers)
- Man-Wai Mak (15 papers)
- Jiachen Lian (22 papers)
- Kong Aik Lee (77 papers)