VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics (2010.02977v3)
Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.
- A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1998, pp. 285–288.
- A. B. Kain, J.-P. Hosom, X. Niu, J. P. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Communication, vol. 49, no. 9, pp. 743–759, 2007.
- K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, 2012.
- Z. Inanoglu and S. Young, “Data-driven emotion conversion in spoken English,” Speech Communication, vol. 51, no. 3, pp. 268–283, 2009.
- O. Türk and M. Schröder, “Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 965–973, 2010.
- T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 9, pp. 2505–2517, 2012.
- P. Jax and P. Vary, “Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003, pp. 680–683.
- D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, “Foreign accent conversion in computer assisted pronunciation training,” Speech Communication, vol. 51, no. 10, pp. 920–932, 2009.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. International Conference on Learning Representations (ICLR), 2014.
- D. P. Kingma, D. J. Rezendey, S. Mohamedy, and M. Welling, “Semi-supervised learning with deep generative models,” in Adv. Neural Information Processing Systems (NIPS), 2014, pp. 3581–3589.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
- L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. International Conference on Learning Representations (ICLR), 2015.
- L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” in Proc. International Conference on Learning Representations (ICLR), 2017.
- D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems 31, 2018, pp. 10 215–10 224.
- C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2016, pp. 1–6.
- C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 3364–3368.
- A. van den Oord and O. Vinyals, “Neural discrete representation learning,” in Adv. Neural Information Processing Systems (NIPS), 2017, pp. 6309–6318.
- W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, and H.-M. Wang, “Voice conversion based on cross-domain features using variational auto encoders,” in Proc. International Symposium on Chinese Spoken Language Processing (ISCSLP), 2018, pp. 165–169.
- Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5274–5278.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432–1443, 2019.
- K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” in Proc. International Conference on Machine Learning (ICML), 2019, pp. 5210–5219.
- L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
- T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in Proc. European Signal Processing Conference (EUSIPCO), 2018, pp. 2100–2104.
- T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “CycleGAN-VC2: Improved cyclegan-based non-parallel voice conversion,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 6820–6824.
- J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. International Conference on Computer Vision (ICCV), 2017, pp. 2223–2232.
- T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. International Conference on Machine Learning (ICML), 2017, pp. 1857–1865.
- Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. International Conference on Computer Vision (ICCV), 2017, pp. 2849–2857.
- P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda, “Non-parallel voice conversion with cyclic variational autoencoder,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 674–678.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 266–273.
- T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2: Rethinking conditional methods for stargan-based voice conversion,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 679–683.
- H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Nonparallel voice conversion with augmented classifier star generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2982–2995, 2020.
- Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” arXiv:1711.09020 [cs.CV], Nov. 2017.
- J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” arXiv:1906.00794 [cs.LG], Jun. 2019.
- K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019, pp. 6805–6809.
- H. Kameoka, K. Tanaka, D. Kwaśny, and N. Hojo, “ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1849–1863, 2020.
- W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” arXiv:1912.06813 [eess.AS], Dec. 2019.
- H. Kameoka, W.-C. Huang, K. Tanaka, T. Kaneko, N. Hojo, and T. Toda, “Many-to-many voice transformer network,” arXiv:2005.08445 [eess.AS], 2020.
- H. Zheng, W. Cai, T. Zhou, S. Zhang, and M. Li, “Text-independent voice conversion using deep neural network based phonetic level features,” in Proc. International Conference on Pattern Recognition (ICPR), 2016, pp. 2872–2877.
- L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in 2016 IEEE International Conference on Multimedia and Expo (ICME), 2016, pp. 1–6.
- H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voice conversion using sequence-to-sequence learning of context posterior probabilities,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 1268–1272.
- L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “WaveNet vocoder with limited training data for voice conversion,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2018, pp. 1983–1987.
- S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng, “Voice conversion across arbitrary speakers based on a single target-speaker utterance,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2018, pp. 496–500.
- J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations,” arXiv:1906.10508 [eess.AS], 2019.
- J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Recognition-synthesis based non-parallel voice conversion with adversarial learning,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2020, pp. 771–775.
- S. Liu, Y. Cao, D. Wang, X. Wu, X. Liu, and H. Meng, “Any-to-many voice conversion with location-relative sequence-to-sequence modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1717–1728, 2021.
- Y. Zhao, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z. Ling, and T. Toda, “Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion,” in Proc. Joint workshop for the Blizzard Challenge and Voice Conversion Challenge, 2020, pp. 80–98.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Adv. Neural Information Processing Systems (NeurIPS), 2019, pp. 11 918–11 930.
- Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” arXiv:2006.09011 [cs.LG], 2020.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 6840–6851.
- A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139, 2021, pp. 8162–8171.
- A. Hyvärinen, “Estimation of non-normalized statistical models using score matching,” Journal of Machine Learning Research, vol. 6, pp. 695–709, 2005.
- P. Vincent, “A connection between score matching and denoising autoencoders.” Neural Computation, vol. 23, no. 7, pp. 1661–1674, 2011.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” arXiv:2009.00713 [eess.AS], 2020.
- S. Liu, Y. Cao, D. Su, and H. Meng, “DiffSVC: A diffusion probabilistic model for singing voice conversion,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 741–748.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, M. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” in Proc. International Conference on Learning Representations (ICLR), 2022.
- H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and S. Seki, “VoiceGrad: Non-parallel any-to-many voice conversion with annealed langevin dynamics,” arXiv:2010.02977 [cs.SD], 2020.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 17 022–17 033.
- S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4835–4839.
- O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” arXiv:1505.04597 [cs.CV], 2015.
- Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. International Conference on Machine Learning (ICML), 2017, pp. 933–941.
- X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. the 13th International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.
- T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems (NIPS), D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016. [Online]. Available: https://proceedings.neurips.cc/paper/2016/file/ed265bc903a5a097f61d3ec064d96d2e-Paper.pdf
- J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Proc. ISCA Speech Synthesis Workshop (SSW), 2004, pp. 223–224.
- https://github.com/auspicious3000/autovc.
- https://github.com/liusongxiang/ppg-vc.
- https://github.com/kamepong/StarGAN-VC.
- M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in Proc. International Conference on Learning Representations (ICLR), 2017.
- I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of Wasserstein GANs,” in Adv. Neural Information Processing Systems (NIPS), 2017, pp. 5769–5779.
- Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, vol. 1711, 2018.
- http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc2/.
- D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. International Conference on Learning Representations (ICLR), 2015.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 12 449–12 460.
- J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux, “Libri-Light: A benchmark for ASR with limited or no supervision,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), vol. 4521-4525, 2022.
- W. C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2022,” in Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2022, pp. 4536–4540.
- http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/voicegrad2/.