Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoiceLens: Controllable Speaker Generation and Editing with Flow (2309.14094v1)

Published 25 Sep 2023 in cs.SD and eess.AS

Abstract: Currently, many multi-speaker speech synthesis and voice conversion systems address speaker variations with an embedding vector. Modeling it directly allows new voices outside of training data to be synthesized. GMM based approaches such as Tacospawn are favored in literature for this generation task, but there are still some limitations when difficult conditionings are involved. In this paper, we propose VoiceLens, a semi-supervised flow-based approach, to model speaker embedding distributions for multi-conditional speaker generation. VoiceLens maps speaker embeddings into a combination of independent attributes and residual information. It allows new voices associated with certain attributes to be \textit{generated} for existing TTS models, and attributes of known voices to be meaningfully \textit{edited}. We show in this paper, VoiceLens displays an unconditional generation capacity that is similar to Tacospawn while obtaining higher controllability and flexibility when used in a conditional manner. In addition, we show synthesizing less noisy speech from known noisy speakers without re-training the TTS model is possible via solely editing their embeddings with a SNR conditioned VoiceLens model. Demos are available at sos1sos2sixteen.github.io/voicelens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “A Survey on Neural Speech Synthesis,” arXiv preprint arXiv:2106.15561, July 2021.
  2. “Deep Voice 2: Multi-Speaker Neural Text-to-Speech,” in Advances in NIPS, 2017, vol. 30.
  3. “Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis,” in Advances in NIPS, 2018, vol. 31.
  4. “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proceedings of ICML, 2018, vol. 80, pp. 5167–5176.
  5. “Expressive speech synthesis via modeling expressions with variational autoencoder,” in Proceedings of Interspeech, 2018, pp. 3067–3071.
  6. “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proceedings of ICASSP, 2020, pp. 6184–6188.
  7. “The Multi-Speaker Multi-Style Voice Cloning Challenge 2021,” in Proceedings of ICASSP, June 2021, pp. 8613–8617.
  8. “Speaker generation,” in Proceedings of ICASSP, 2022, pp. 7897–7901.
  9. “Speaker Anonymization Using X-vector and Neural Waveform Models,” arXiv preprint arXiv:1905.13561, May 2019.
  10. “Generating identities with mixture models for speaker anonymization,” Computer Speech & Language, vol. 72, pp. 101318, 2022.
  11. “Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech,” in Proceedings of IEEE SLT, 2021, pp. 316–322.
  12. “The voiceprivacy 2020 challenge: Results and findings,” Comput. Speech Lang., vol. 74, pp. 101362, 2022.
  13. “Voiceme: Personalized voice generation in TTS,” in Proceedings of Interspeech, 2022, pp. 2588–2592.
  14. “Gibbs sampling with people,” in Advances in NeurIPS, 2020.
  15. “Face-based voice conversion: Learning the voice behind a face,” in ACM Multimedia Conference, 2021, pp. 496–505.
  16. “Creating new voices using normalizing flows,” in Proceedings of Interspeech, 2022.
  17. “Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow,” in Proceedings of ICASSP, 2020, pp. 7209–7213.
  18. “Normalizing flows for probabilistic modeling and inference,” J. Mach. Learn. Res., vol. 22, pp. 57:1–57:64, 2021.
  19. “Understanding the limitations of conditional generative models,” in Proceedings of ICLR, 2020.
  20. “Negative sampling in variational autoencoders,” in Proceedings of IEEE CITDS, 2022, pp. 63–68.
  21. “Dimensional chains involving rectangular and normal error-distributions,” Technometrics, vol. 5, no. 3, pp. 404–406, 1963.
  22. “Semi-Supervised Learning with Normalizing Flows,” in Proceedings of ICML, 2020, pp. 4615–4630.
  23. “Semi-Conditional Normalizing Flows for Semi-Supervised Learning,” arXiv preprint arXiv:1905.00505, June 2020.
  24. “Didispeech: A large scale mandarin speech corpus,” in Proceedings of ICASSP, 2021, pp. 6968–6972.
  25. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proceedings of ICML, 2021, pp. 5530–5540.
  26. “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2022.
  27. “nflows: normalizing flows in PyTorch,” 2020.
  28. “Masked autoregressive flow for density estimation,” in Advances in NIPS, 2017, pp. 2338–2347.
  29. “AISHELL-3: A Multi-Speaker Mandarin TTS Corpus,” in Proceedings of Interspeech, 2021, pp. 2756–2760.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yao Shi (14 papers)
  2. Ming Li (787 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.