SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention (2312.08676v2)
Abstract: Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embedding free voice conversion model, which is designed to learn and incorporate speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism, and then reconstruct waveform from HuBERT semantic tokens in a non-autoregressive manner. The concise design of SEF-VC enhances its training stability and voice conversion performance. Objective and subjective evaluations demonstrate the superiority of SEF-VC to generate high-quality speech with better similarity to target reference than strong zero-shot VC baselines, even for very short reference speeches.
- C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in APSIPA. IEEE, 2016, pp. 1–6.
- K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Proc. ICML. PMLR, 2019, pp. 5210–5219.
- K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in Proc. ICML. PMLR, 2020, pp. 7836–7846.
- C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in Proc. IEEE ICASSP, 2022, pp. 6332–6336.
- Q. Wang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Drvc: A framework of any-to-any voice conversion with self-supervised learning,” in Proc. IEEE ICASSP, 2022, pp. 3184–3188.
- S. Wang and D. Borth, “Noisevc: Towards high quality zero-shot voice conversion,” arXiv preprint arXiv:2104.06074, 2021.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” Proc. NeurIPS, vol. 31, 2018.
- E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. C. Jr., A. da Silva Soares, S. M. Aluísio, and M. A. Ponti, “Sc-glowtts: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech 2021, pp. 3645–3649.
- C. Du, Y. Guo, X. Chen, and K. Yu, “Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature,” IEEE/ACM Trans. ASLP., 2023.
- E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in Proc. ICML. PMLR, 2022, pp. 2709–2720.
- A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. ICLR, 2020.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. ASLP., vol. 29, pp. 3451–3460, 2021.
- A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux, “Speech resynthesis from discrete disentangled self-supervised representations,” in Proc. Interspeech 2021, pp. 3615–3619.
- J. Lin, Y. Y. Lin, C. Chien, and H. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech 2021, pp. 836–840.
- T. Dang, D. Tran, P. Chin, and K. Koishida, “Training robust zero-shot voice conversion models with self-supervised features,” in Proc. IEEE ICASSP, 2022, pp. 6557–6561.
- W.-C. Huang, Y.-C. Wu, and T. Hayashi, “Any-to-one sequence-to-sequence voice conversion using self-supervised discrete speech representations,” in Proc. IEEE ICASSP, 2021, pp. 5944–5948.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. IEEE ICASSP, 2018, pp. 5329–5333.
- Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in Proc. IEEE ICASSP, 2018, pp. 5274–5278.
- J. Lian, C. Zhang, G. K. Anumanchipalli, and D. Yu, “Towards Improved Zero-shot Voice Conversion with Conditional DSVAE,” in Proc. Interspeech 2022, pp. 2598–2602.
- J. Lian, C. Zhang, and D. Yu, “Robust disentangled variational speech representation learning for zero-shot voice conversion,” in Proc. IEEE ICASSP, 2022, pp. 6572–6576.
- R. Xiao, H. Zhang, and Y. Lin, “Dgc-vector: A new speaker embedding for zero-shot voice conversion,” in Proc. IEEE ICASSP, 2022, pp. 6547–6551.
- Z. Tan, J. Wei, J. Xu, Y. He, and W. Lu, “Zero-shot voice conversion with adjusted speaker embeddings and simple acoustic features,” in Proc. IEEE ICASSP, 2021, pp. 5964–5968.
- Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Trans. ASLP., 2023.
- Z. Wang, Y. Chen, L. Xie, Q. Tian, and Y. Wang, “Lm-vc: Zero-shot voice conversion via speech generation based on language models,” arXiv preprint arXiv:2306.10521, 2023.
- R. Huang, C. Zhang, Y. Wang, D. Yang, L. Liu, Z. Ye, Z. Jiang, C. Weng, Z. Zhao, and D. Yu, “Make-a-voice: Unified voice synthesis with discrete representation,” arXiv preprint arXiv:2305.19269, 2023.
- C. Du, Y. Guo, F. Shen, Z. Liu, Z. Liang, X. Chen, S. Wang, H. Zhang, and K. Yu, “Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding,” arXiv preprint arXiv:2306.07547, 2023.
- C. Du, Y. Guo, X. Chen, and K. Yu, “VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature,” in Proc. Interspeech, 2022, pp. 1596–1600.
- J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, pp. 5036–5040.
- H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” 2019.
- J. chieh Chou and H.-Y. Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, pp. 664–668.
- S. Hussain, P. Neekhara, J. Huang, J. Li, and B. Ginsburg, “Ace-vc: Adaptive and controllable voice conversion using explicitly disentangled self-supervised speech representations,” in Proc. IEEE ICASSP, 2023, pp. 1–5.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in Proc. IEEE ASRU. IEEE Signal Processing Society, 2011.