Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation (2405.00603v1)

Published 1 May 2024 in cs.SD and eess.AS

Abstract: Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 2019, pp. 5210–5219.
  2. D. Wu and H. Lee, “One-shot voice conversion by vector quantization,” in ICASSP.   IEEE, 2020, pp. 7734–7738.
  3. D. Wu, Y. Chen, and H. Lee, “VQVC+: one-shot voice conversion by vector quantization and u-net architecture,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, 2020, pp. 4691–4695.
  4. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” in ICASSP 2022.   IEEE, 2022, pp. 4613–4617.
  5. Y. Deng, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Cln-vc: Text-free voice conversion based on fine-grained style control and contrastive learning with negative samples augmentation,” in ISPA, 2023, pp. 1143–1148.
  6. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” in The Eighth International Conference on Learning Representations, ICLR 2020.   OpenReview.net, 2020.
  7. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  8. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  9. A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  10. C. Wang, Y. Wu, S. Chen, S. Liu, J. Li, Y. Qian, and Z. Yang, “Improving self-supervised learning for speech recognition with intermediate layer supervision,” in ICASSP 2022.   IEEE, 2022, pp. 7092–7096.
  11. H. Tsai, H. Chang, W. Huang et al., “SUPERB-SG: enhanced speech processing universal performance benchmark for semantic and generative capabilities,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, 2022, pp. 8479–8492.
  12. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  13. Z. Ning, Q. Xie, P. Zhu, Z. Wang, L. Xue, J. Yao, L. Xie, and M. Bi, “Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,” in ICASSP.   IEEE, 2023, pp. 1–5.
  14. Y. Deng, H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Pmvc: Data augmentation-based prosody modeling for expressive voice conversion,” in 31st ACM International Conference on Multimedia, 2023.
  15. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 2020, pp. 7836–7846.
  16. S.-H. Lee, J.-H. Kim, H. Chung, and S.-W. Lee, “Voicemixer: Adversarial voice style mixup,” Advances in Neural Information Processing Systems, vol. 34, pp. 294–308, 2021.
  17. K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “Contentvec: An improved self-supervised speech representation by disentangling speakers,” in Proceedings of the 39th International Conference on Machine Learning, ICML 2022, 2022, pp. 18 003–18 017.
  18. H.-S. Choi, J. Yang, J. Lee, and H. Kim, “Nansy++: Unified voice synthesis with neural analysis and synthesis,” in The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.
  19. X. Li, Y. Dai, Y. Ge, J. Liu, Y. Shan, and L. Duan, “Uncertainty modeling for out-of-distribution generalization,” in The Tenth International Conference on Learning Representations, ICLR 2022, 2022.
  20. Z. Zhong, Y. Zhao, G. H. Lee, and N. Sebe, “Adversarial style augmentation for domain generalized urban-scene segmentation,” Advances in Neural Information Processing Systems, vol. 35, pp. 338–350, 2022.
  21. Y. Zhang, B. Deng, R. Li, K. Jia, and L. Zhang, “Adversarial style augmentation for domain generalization,” CoRR, vol. abs/2301.12643, 2023.
  22. B. van Niekerk, M. Carbonneau, J. Zaïdi, M. Baas, H. Seuté, and H. Kamper, “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP.   IEEE, 2022, pp. 6562–6566.
  23. H. Tang, X. Zhang, J. Wang, N. Cheng, Z. Zeng, E. Xiao, and J. Xiao, “Tgavc: Improving autoencoder voice conversion with text-guided and adversarial training,” in ASRU, 2021, pp. 938–945.
  24. H. J. Park, S. W. Yang, J. S. Kim, W. Shin, and S. W. Han, “Triaan-vc: Triple adaptive attention normalization for any-to-any voice conversion,” in ICASSP.   IEEE, 2023, pp. 1–5.
  25. J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP.   IEEE, 2023, pp. 1–5.
  26. C. Wang, Y. Wu, S. Chen, S. Liu, J. Li, Y. Qian, and Z. Yang, “Improving self-supervised learning for speech recognition with intermediate layer supervision,” in ICASSP.   IEEE, 2022, pp. 7092–7096.
  27. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 251–16 265, 2021.
  28. X. Zhao, S. Wang, Y. Chao, Z. Wu, and H. Meng, “Adversarial speaker disentanglement using unannotated external data for self-supervised representation-based voice conversion,” in IEEE International Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia, July 10-14, 2023.   IEEE, 2023, pp. 1691–1696.
  29. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
  30. J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, 2019, pp. 664–668.
  31. O. Nuriel, S. Benaim, and L. Wolf, “Permuted adain: Reducing the bias towards global statistics in image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9482–9491.
  32. J. Weston, R. Lenain, U. Meepegama, and E. Fristed, “Learning de-identified representations of prosody from raw audio,” in International Conference on Machine Learning.   PMLR, 2021, pp. 11 134–11 145.
  33. G.-T. Lin, C.-L. Feng, W.-P. Huang, Y. Tseng, T.-H. Lin, C.-A. Li, H.-y. Lee, and N. G. Ward, “On the utility of self-supervised models for prosody-related tasks,” in 2022 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2023, pp. 1104–1111.
  34. L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP.   IEEE, 2018, pp. 4879–4883.
  35. L.-W. Chen, S. Watanabe, and A. Rudnicky, “A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units,” in ICASSP.   IEEE, 2023, pp. 1–5.
  36. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in ICASSP.   IEEE, 2022, pp. 6332–6336.
  37. C. Veaux, J. Yamagishi, K. MacDonald et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), vol. 6, p. 15, 2016.
  38. K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP.   IEEE, 2021, pp. 920–924.
  39. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  40. Z. Gao, Z. Li, J. Wang, H. Luo, X. Shi, M. Chen, Y. Li, L. Zuo, Z. Du, Z. Xiao, and S. Zhang, “Funasr: A fundamental end-to-end speech recognition toolkit,” in Interspeech 2023, 24th Annual Conference of the International Speech Communication Association, 2023.
  41. H. Wang, C. Liang, S. Wang et al., “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP.   IEEE, 2023, pp. 1–5.
  42. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Vq-cl: Learning disentangled speech representations with contrastive learning and vector quantization,” in ICASSP.   IEEE, 2023, pp. 1–5.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yimin Deng (9 papers)
  2. Jianzong Wang (144 papers)
  3. Xulong Zhang (60 papers)
  4. Ning Cheng (96 papers)
  5. Jing Xiao (267 papers)

Summary

We haven't generated a summary for this paper yet.