Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model (2405.01730v1)
Abstract: Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.
- “A review on five recent and near-future developments in computational processing of emotion in the human voice,” Emotion Review, p. 1754073919898526, 2020.
- “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, pp. 1–16, 2022.
- “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 594–601.
- “Iqdubbing: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion,” arXiv preprint arXiv:2201.00269, 2022.
- “Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion,” in Proc. Interspeech 2022, 2022, pp. 2603–2607.
- “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2021.
- “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
- “Gmm-based emotional voice conversion using spectrum and prosody features,” American Journal of Signal Processing, vol. 2, no. 5, pp. 134–138, 2012.
- “Voice conversion across arbitrary speakers based on a single target-speaker utterance.,” in Interspeech, 2018, pp. 496–500.
- “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning. PMLR, 2019, pp. 5210–5219.
- “VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion,” in Proc. Interspeech 2021, 2021, pp. 1344–1348.
- “Vqvae unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019,” Proc. Interspeech 2019, pp. 1118–1122, 2019.
- “crank: An open-source software for nonparallel voice conversion based on vector-quantized variational autoencoder,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5934–5938.
- “Tvqvc: Transformer based vector quantized variational autoencoder with ctc loss for voice conversion.,” in Interspeech, 2021, pp. 826–830.
- “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5954–5958.
- “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668.
- “Limi-vc: A light weight voice conversion model with mutual information disentanglement,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
- “Unsupervised personalization of an emotion recognition system: The unique properties of the externalization of valence in speech,” IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 1959–1972, 2022.
- “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 415–422.
- “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2023.
- “Converting anyone’s emotion: Towards speaker-independent emotional voice conversion,” Proc. Interspeech 2020, pp. 3416–3420, 2020.
- “Nvc-net: End-to-end adversarial voice conversion,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7012–7016.
- “Nonparallel emotional voice conversion for unseen speaker-emotion pairs using dual domain adversarial network & virtual domain pairing,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
- “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations, 2021.
- “Fastdiff: A fast conditional diffusion model for high-quality speech synthesis,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt, Ed. 7 2022, pp. 4157–4163, International Joint Conferences on Artificial Intelligence Organization, Main Track.
- “Diffsvc: A diffusion probabilistic model for singing voice conversion,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 741–748.
- “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” in International Conference on Learning Representations, 2021.
- “A unified system for voice cloning and voice conversion through diffusion probabilistic modeling.,” in INTERSPEECH, 2022, pp. 3003–3007.
- “S2vc: A framework for any-to-any voice conversion with self-supervised pretrained representations,” arXiv preprint arXiv:2104.02901, 2021.
- “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, 2022.
- “FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 3058–3062.
- “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
- “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech 2019, pp. 3465–3469.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, vol. 33, pp. 12449–12460.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model,” in Proc. INTERSPEECH 2023, 2023, pp. 1923–1927.
- “Textless speech emotion conversion using discrete & decomposed representations,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022, pp. 11200–11214, Association for Computational Linguistics.
- “High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units,” in Proc. INTERSPEECH 2023, 2023, pp. 5536–5540.
- “Any-to-one sequence-to-sequence voice conversion using self-supervised discrete speech representations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5944–5948.
- “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations,” in Proc. Interspeech 2021, 2021.
- “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6562–6566.
- “A study of speaker verification performance with expressive speech,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5540–5544.
- “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
- Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.
- “3-d convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Processing Letters, vol. 25, no. 10, pp. 1440–1444, 2018.
- “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.
- “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
- Robert Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. IEEE, 1993, vol. 1, pp. 125–128.
- “Spectrum and prosody conversion for cross-lingual voice conversion with cyclegan,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp. 507–513.
- “A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments,” Speech Communication, vol. 50, no. 3, pp. 203–214, 2008.
- “Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3969–3972.
- “Neutral-to-emotional voice conversion with cross-wavelet transform f0 using generative adversarial networks,” APSIPA Transactions on Signal and Information Processing, vol. 8, pp. e10, 2019.
- Zongyang Du (7 papers)
- Junchen Lu (7 papers)
- Kun Zhou (217 papers)
- Lakshmish Kaushik (2 papers)
- Berrak Sisman (49 papers)