Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks (2306.06514v1)

Published 10 Jun 2023 in cs.SD and eess.AS

Abstract: Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2021.
  2. “On the impact of alignment on voice conversion performance,” in Proc. Interspeech. ISCA, 2008, pp. 1453–1456.
  3. S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
  4. “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” in ICML, 2019, pp. 5210–5219.
  5. J. Chou and H. Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech. ISCA, 2019, pp. 664–668.
  6. “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” in 2020 IEEE ICASSP, 2020, pp. 6284–6288.
  7. “Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention,” in 2021 IEEE ICASSP, 2021, pp. 5939–5943.
  8. T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 26th EUSIPCO, 2018, pp. 2100–2104.
  9. “CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion,” in 2019 IEEE Intl. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
  10. “CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion,” in Proc. Interspeech. ISCA, 2020, pp. 2017–2021.
  11. “MaskCycleGAN-VC: Learning non-parallel voice conversion with filling in frames,” in 2021 IEEE ICASSP, 2021, pp. 5919–5923.
  12. B. Nguyen and F. Cardinaux, “NVC-Net: End-to-end adversarial voice conversion,” in 2022 IEEE ICASSP, 2022, pp. 7012–7016.
  13. “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” in ICLR, 2022.
  14. “DiffSVC: A diffusion probabilistic model for singing voice conversion,” in 2021 IEEE ASRU, 2021, pp. 741–748.
  15. “Image-to-image translation with conditional adversarial networks,” 2017 IEEE CVPR, pp. 5967–5976, 2016.
  16. “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017.
  17. “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in 2018 IEEE/CVF CVPR, 2018, pp. 8789–8797.
  18. “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” in 2018 IEEE ICASSP, 2018, p. 5279–5283.
  19. C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in ECCV, 2016, pp. 702–716.
  20. “Effectiveness of cross-domain architectures for whisper-to-normal speech conversion,” in 27th EUSIPCO, 2019, pp. 1–5.
  21. “CinC-GAN for effective F0 prediction for whisper-to-normal speech conversion,” in 28th EUSIPCO, 2021, pp. 411–415.
  22. “Mspec-Net: Multi-domain speech conversion network,” in 2020 IEEE ICASSP, 2020, pp. 7764–7768.
  23. “A novel attention-guided generative adversarial network for whisper-to-normal speech conversion,” Cognitive Computation, Jan. 2023.
  24. “Whispered speech to neutral speech conversion using bidirectional lstms,” in Proc. Interspeech. ISCA, 2018, pp. 491–495.
  25. “Generative models for improved naturalness, intelligibility, and voicing of whispered speech,” in 2022 IEEE SLT, 2023, pp. 943–948.
  26. “Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks,” in Proc. IberSPEECH 2018, 2018, pp. 117–121.
  27. J. Rekimoto, “WESPER: Zero-shot and realtime whisper to normal voice conversion for whisper-based speech interactions,” arXiv, 2023.
  28. “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeurIPS, 2020, vol. 33, pp. 17022–17033.
  29. “Generative adversarial nets,” in NeurIPS, 2014, vol. 27.
  30. “Deep residual learning for image recognition,” in 2016 IEEE CVPR, 2016, pp. 770–778.
  31. “Instance normalization: The missing ingredient for fast stylization,” arXiv, 2016.
  32. “Language modeling with gated convolutional networks,” in ICML 2017, 2017, p. 933–941.
  33. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in 2016 IEEE CVPR, 2016, pp. 1874–1883.
  34. “Unsupervised cross-domain image generation,” in ICLR, 2017.
  35. “Least squares generative adversarial networks,” in 2017 IEEE Intl. Conference on Computer Vision (ICCV), 2017, pp. 2813–2821.
  36. B. P. Lim, Computational differences between whispered and non-whispered speech, Ph.D. thesis, University of Illinois, 2010.
  37. “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2019.
  38. “TIMIT acoustic-phonetic continuous speech corpus LDC93S1,” Linguistic Data Consortium, 1993.
  39. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd ICLR, 2015.
  40. “MelGAN: Generative adversarial networks for conditional waveform synthesis,” in NeurIPS, 2019, vol. 32.
  41. “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, pp. 1877–1884, 2016.
  42. “Utilizing Self-Supervised Representations for MOS Prediction,” in Proc. Interspeech. ISCA, 2021, pp. 2781–2785.

Summary

We haven't generated a summary for this paper yet.