Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using joint training speaker encoder with consistency loss to achieve cross-lingual voice conversion and expressive voice conversion (2307.00393v1)

Published 1 Jul 2023 in eess.AS

Abstract: Voice conversion systems have made significant advancements in terms of naturalness and similarity in common voice conversion tasks. However, their performance in more complex tasks such as cross-lingual voice conversion and expressive voice conversion remains imperfect. In this study, we propose a novel approach that combines a jointly trained speaker encoder and content features extracted from the cross-lingual speech recognition model Whisper to achieve high-quality cross-lingual voice conversion. Additionally, we introduce a speaker consistency loss to the joint encoder, which improves the similarity between the converted speech and the reference speech. To further explore the capabilities of the joint speaker encoder, we use the phonetic posteriorgram as the content feature, which enables the model to effectively reproduce both the speaker characteristics and the emotional aspects of the reference speech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2100–2104.
  2. “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  3. “Accent and speaker disentanglement in many-to-many voice conversion,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2021, pp. 1–5.
  4. “Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6790–6794.
  5. “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 594–601.
  6. “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6562–6566.
  7. “Diffsvc: A diffusion probabilistic model for singing voice conversion,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 741–748.
  8. “Duration controllable voice conversion via phoneme-based information bottleneck,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1173–1183, 2022.
  9. “Any-to-many voice conversion with location-relative sequence-to-sequence modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1717–1728, 2021.
  10. “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020.
  11. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
  12. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
  13. “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 11020–11028.
  14. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  15. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  16. “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  17. “Ace-vc: Adaptive and controllable voice conversion using explicitly disentangled self-supervised speech representations,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  18. “Quickvc: Many-to-any voice conversion using inverse short-time fourier transform for faster conversion,” arXiv preprint arXiv:2302.08296, 2023.
  19. “Nvc-net: End-to-end adversarial voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7012–7016.
  20. “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
  21. “Cross-lingual voice conversion with a cycle consistency loss on linguistic representation.,” in Interspeech, 2021, pp. 1374–1378.
  22. “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  23. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  24. “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, vol. 32, 2019.
  25. “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv preprint arXiv:2010.11567, 2020.
  26. “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
  27. “Jvs corpus: free japanese multi-speaker voice corpus,” arXiv preprint arXiv:1908.06248, 2019.
  28. “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.
  29. “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.
  30. “Japanese-english code-switching speech data construction,” in 2018 Oriental COCOSDA-International Conference on Speech Database and Assessments. IEEE, 2018, pp. 67–71.
  31. “Deconvolution and checkerboard artifacts,” Distill, vol. 1, no. 10, pp. e3, 2016.
  32. “The emotional voices database: Towards controlling the emotion dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.
  33. “Disentanglement of emotional style and speaker identity for expressive voice conversion,” arXiv preprint arXiv:2110.10326, 2021.
  34. “Libritts-r: A restored multi-speaker text-to-speech corpus,” arXiv preprint arXiv:2305.18802, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Houjian Guo (2 papers)
  2. Chaoran Liu (9 papers)
  3. Carlos Toshinori Ishi (4 papers)
  4. Hiroshi Ishiguro (19 papers)
Citations (5)