Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion (2308.06382v2)

Published 11 Aug 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Voice Conversion With Just Nearest Neighbors. Interspeech 2023.
  2. Distribution-based sketching of single-cell samples. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–10.
  3. Exchangeable generative models with flow scans. In AAAI 2020.
  4. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In ICML 2022.
  5. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
  6. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization.
  7. Towards a Neural Statistician. In International Conference on Learning Representations.
  8. SynthASR: Unlocking synthetic data for speech recognition. In Interspeech 2021.
  9. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML 2017.
  10. HyperNetworks. In ICLR 2017.
  11. Synt++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022. IEEE.
  12. Cute: A concatenative method for voice conversion using exemplar-based unit selection. In ICASSP 2016. IEEE.
  13. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS 2020.
  14. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML 2019.
  15. Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. In ICASSP 2023. IEEE.
  16. Partially Observed Exchangeable Modeling. In ICML 2021.
  17. Exchangeable neural ode for set modeling. NeurIPS 2020.
  18. An overview of voice conversion systems. Speech Communication, 88: 65–82.
  19. Librispeech: an asr corpus based on public domain audio books. In ICASSP 2015. IEEE.
  20. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
  21. Autovc: Zero-shot voice style transfer with only autoencoder loss. In ICML 2019.
  22. Robust speech recognition via large-scale weak supervision. In ICML 2023.
  23. DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6169–6173.
  24. Transparent single-cell set classification with kernel mean embeddings. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–10.
  25. Nrtsi: Non-recurrent time series imputation. In ICASSP 2023, 1–5. IEEE.
  26. X-vectors: Robust dnn embeddings for speaker recognition. In ICASSP 2018. IEEE.
  27. Consistency Models. arXiv preprint arXiv:2303.01469.
  28. Visualizing data using t-SNE. Journal of machine learning research.
  29. VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion.
  30. End-to-end bootstrapping neural network for entity set expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 9402–9409.
  31. Deep sets. NIPS 2017.
  32. Empower Entity Set Expansion via Language Model Probing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8151–8160.
  33. Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Siyuan Shan (10 papers)
  2. Yang Li (1142 papers)
  3. Amartya Banerjee (4 papers)
  4. Junier B. Oliva (27 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.