Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion (2308.06382v2)
Abstract: Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.
- Voice Conversion With Just Nearest Neighbors. Interspeech 2023.
- Distribution-based sketching of single-cell samples. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–10.
- Exchangeable generative models with flow scans. In AAAI 2020.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In ICML 2022.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
- One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization.
- Towards a Neural Statistician. In International Conference on Learning Representations.
- SynthASR: Unlocking synthetic data for speech recognition. In Interspeech 2021.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML 2017.
- HyperNetworks. In ICLR 2017.
- Synt++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022. IEEE.
- Cute: A concatenative method for voice conversion using exemplar-based unit selection. In ICASSP 2016. IEEE.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS 2020.
- Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML 2019.
- Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. In ICASSP 2023. IEEE.
- Partially Observed Exchangeable Modeling. In ICML 2021.
- Exchangeable neural ode for set modeling. NeurIPS 2020.
- An overview of voice conversion systems. Speech Communication, 88: 65–82.
- Librispeech: an asr corpus based on public domain audio books. In ICASSP 2015. IEEE.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
- Autovc: Zero-shot voice style transfer with only autoencoder loss. In ICML 2019.
- Robust speech recognition via large-scale weak supervision. In ICML 2023.
- DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6169–6173.
- Transparent single-cell set classification with kernel mean embeddings. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 1–10.
- Nrtsi: Non-recurrent time series imputation. In ICASSP 2023, 1–5. IEEE.
- X-vectors: Robust dnn embeddings for speaker recognition. In ICASSP 2018. IEEE.
- Consistency Models. arXiv preprint arXiv:2303.01469.
- Visualizing data using t-SNE. Journal of machine learning research.
- VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion.
- End-to-end bootstrapping neural network for entity set expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 9402–9409.
- Deep sets. NIPS 2017.
- Empower Entity Set Expansion via Language Model Probing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8151–8160.
- Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527.
- Siyuan Shan (10 papers)
- Yang Li (1142 papers)
- Amartya Banerjee (4 papers)
- Junier B. Oliva (27 papers)