Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AE-Flow: AutoEncoder Normalizing Flow (2312.16552v1)

Published 27 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Recently normalizing flows have been gaining traction in text-to-speech (TTS) and voice conversion (VC) due to their state-of-the-art (SOTA) performance. Normalizing flows are unsupervised generative models. In this paper, we introduce supervision to the training process of normalizing flows, without the need for parallel data. We call this training paradigm AutoEncoder Normalizing Flow (AE-Flow). It adds a reconstruction loss forcing the model to use information from the conditioning to reconstruct an audio sample. Our goal is to understand the impact of each component and find the right combination of the negative log-likelihood (NLL) and the reconstruction loss in training normalizing flows with coupling blocks. For that reason we will compare flow-based mapping model trained with: (i) NLL loss, (ii) NLL and reconstruction losses, as well as (iii) reconstruction loss only. Additionally, we compare our model with SOTA VC baseline. The models are evaluated in terms of naturalness, speaker similarity, intelligibility in many-to-many and many-to-any VC settings. The results show that the proposed training paradigm systematically improves speaker similarity and naturalness when compared to regular training methods of normalizing flows. Furthermore, we show that our method improves speaker similarity and intelligibility over the state-of-the-art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” in Speech Communication, 2017.
  2. B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020.
  3. M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1988.
  4. K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion by codebooktitle mapping,” in IEEE International Symposium on Circuits and Systems (ISCAS), 1991.
  5. J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Odyssey: The Speaker and Language Recognition Workshop, 2018.
  6. A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” in IEEE Transactions on Audio, Speech, and Language Processing, 2006.
  7. J. Serrà, S. Pascual, and C. Segura Perales, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
  8. T. Merritt, A. Ezzerg, P. Bilinski, M. Proszewska, K. Pokora, R. Barra-Chicote, and D. Korzekwa, “Text-free non-parallel many-to-many voice conversion using normalising flows,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022.
  9. E. Helander, J. Schwarz, J. Nurminen, H. Silén, and M. Gabbouj, “On the impact of alignment on voice conversion performance,” in INTERSPEECH, 2008.
  10. P. Bilinski, T. Merritt, A. Ezzerg, K. Pokora, S. Cygert, K. Yanagisawa, R. Barra-Chicote, and D. Korzekwa, “Creating new voices using normalizing flows,” in INTERSPEECH, 2022.
  11. C. Miao, S. Liang, M. Chen, J. Ma, S. Wang, and J. Xiao, “Flow-tts: A non-autoregressive network for text to speech based on flow,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  12. J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  13. R. Valle, K. J. Shih, R. Prenger, and B. Catanzaro, “Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis,” in International Conference on Learning Representations (ICLR), 2021.
  14. E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model,” in INTERSPEECH, 2021.
  15. I. Kobyzev, S. J. Prince, and M. A. Brubaker, “Normalizing flows: An introduction and review of current methods,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  16. D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
  17. J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++: Improving flow-based generative models with variational dequantization and architecture design,” in International Conference on Machine Learning (ICML), 2019.
  18. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  19. R. Shah, K. Pokora, A. Ezzerg, V. Klimkov, G. Huybrechts, B. Putrycz, D. Korzekwa, and T. Merritt, “Non-autoregressive tts with explicit duration modelling for low-resource highly expressive speech,” in ISCA Speech Synthesis Workshop (SSW11), 2021.
  20. Y. Jiao, A. Gabryś, G. Tinchev, B. Putrycz, D. Korzekwa, and V. Klimkov, “Universal neural vocoding with parallel wavenet,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  21. S. Karlapati, A. Moinet, A. Joly, V. Klimkov, D. S. Trigueros, and T. Drugman, “CopyCat: Many-to-many fine-grained prosody transfer for neural text-to-speech,” in INTERSPEECH, 2020.
  22. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  23. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NeurIPS), 2019.
  24. B. Series, “Method for the subjective assessment of intermediate quality level of audio systems,” in International Telecommunication Union Radiocommunication Assembly, 2014.
  25. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
Citations (2)

Summary

We haven't generated a summary for this paper yet.