Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenVoice: Versatile Instant Voice Cloning (2312.01479v6)

Published 3 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. OpenVoice has been used by more than 2M users worldwide as the voice engine of MyShell.ai

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. I. P. Association. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press, 1999.
  2. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
  3. CoquiAI. Xtts taking text-to-speech to the next level. Technical Blog, 2023.
  4. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  5. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
  6. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
  7. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  8. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
  9. Freevc: Towards high-quality text-free one-shot voice conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  10. M. Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  11. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  12. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  13. P. Senin. Dynamic time warping algorithm review. Information and Computer Science Department University of Hawaii at Manoa Honolulu, USA, 855(1-23):40, 2008.
  14. A comparison of discrete and soft speech units for improved voice conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6562–6566. IEEE, 2022.
  15. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  16. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  17. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023.
  18. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
Citations (16)

Summary

  • The paper introduces a method for replicating voices instantly from brief audio samples with granular control over style, emotion, accent, and intonation.
  • It demonstrates zero-shot cross-lingual voice cloning, enabling high-quality speech synthesis in languages absent from the training data.
  • The research offers a real-time, accessible solution with publicly available source code, driving further studies in speech synthesis and customization.

Introduction to OpenVoice

OpenVoice is introduced as a method for instant voice cloning, which allows replication of a person's voice from a brief audio sample without additional training specific to that person. This technology has significant implications for a range of applications, from media content creation to personalized chatbots and human-computer interfaces. Unlike previous methods, OpenVoice offers granular control over various elements of speech style, including emotion, accent, rhythm, pauses, and intonation, while also achieving the cloning of tone color. Remarkably, it operates effectively across languages, even those not included in the training dataset.

Versatility and Control

The methodology enables flexibility in manipulating numerous voice style parameters beyond tone color, which is typically the main focus of other voice cloning techniques. OpenVoice provides the users with the capability to customize the generated voice for nuanced and natural speech. This level of control transcends simply repeating text monotonously and is significant for creating conversational and context-specific audio outputs.

Cross-Lingual Capabilities

A standout feature of OpenVoice is its zero-shot cross-lingual voice cloning for languages not covered during training. Where conventional approaches depend on extensive multi-lingual datasets for voice cloning in multiple languages, OpenVoice demonstrates the exceptional ability to transfer a cloned voice into a completely foreign language without requiring data for that specific language in the training set. This feature positions OpenVoice as an invaluable tool for global applications where language barriers are commonly encountered.

Real-time Performance and Accessibility

Besides the advanced capabilities in voice style and language versatility, OpenVoice is designed for high-speed performance suitable for commercial production environments, without compromising quality. Reflecting its real-world viability, an internal version of OpenVoice has already been deployed extensively, processing tens of millions of user interactions. Further contributing to the field, the research team has made the source code and trained model publicly available, encouraging follow-up studies and technological advancements.

Conclusion

OpenVoice represents a significant leap in instant voice cloning. It uniquely decouples voice components such as tone color, styles, and language, which empowers users to modulate voice attributes independently and facilitates speech generation in various languages. The approach simplifies the integration of new languages and voice styles, bolstering the scope of voice cloning technology. With its pre-trained models and publicly accessible source code, OpenVoice invites further exploration and development in the domain of speech synthesis.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. OpenVoice: Versatile Instant Voice Cloning (397 points, 190 comments)
  2. OpenVoice: Instant Voice Cloning (268 points, 152 comments)
  3. OpenVoice V2 Released (3 points, 0 comments)