PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion (2403.01494v1)
Abstract: In this paper, we propose Prosody-aware VITS (PAVITS) for emotional voice conversion (EVC), aiming to achieve two major objectives of EVC: high content naturalness and high emotional naturalness, which are crucial for meeting the demands of human perception. To improve the content naturalness of converted audio, we have developed an end-to-end EVC architecture inspired by the high audio quality of VITS. By seamlessly integrating an acoustic converter and vocoder, we effectively address the common issue of mismatch between emotional prosody training and run-time conversion that is prevalent in existing EVC models. To further enhance the emotional naturalness, we introduce an emotion descriptor to model the subtle prosody variations of different speech emotions. Additionally, we propose a prosody predictor, which predicts prosody features from text based on the provided emotion label. Notably, we introduce a prosody alignment loss to establish a connection between latent prosody features from two distinct modalities, ensuring effective training. Experimental results show that the performance of PAVITS is superior to the state-of-the-art EVC methods. Speech Samples are available at https://jeremychee4.github.io/pavits4EVC/ .
- “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, jan 2023.
- “Multi-speaker emotional speech synthesis with fine-grained prosody modeling,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 5729–5733.
- “Real-time speech emotion analysis for smart home assistants,” IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021.
- “3d virtual worlds and the metaverse: Current status and future possibilities,” ACM Computing Surveys (CSUR), vol. 45, no. 3, pp. 1–38, 2013.
- “An improved cyclegan-based emotional voice conversion model by augmenting temporal dependency with a transformer,” Speech Communication, vol. 144, pp. 110–121, 2022.
- “Expressive voice conversion: A joint framework for speaker identity and emotional style transfer,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 594–601.
- “An improved stargan for emotional voice conversion: Enhancing voice quality and data augmentation,” arXiv preprint arXiv:2107.08361, 2021.
- “Speaker-independent emotional voice conversion via disentangled representations,” IEEE Transactions on Multimedia, 2022.
- “One-shot emotional voice conversion based on feature separation,” Speech Communication, vol. 143, pp. 1–9, 2022.
- “An overview & analysis of sequence-to-sequence emotional voice conversion,” 2022.
- “Improving model stability and training efficiency in fast, high quality expressive voice conversion system,” in Companion Publication of the 2021 International Conference on Multimodal Interaction, 2021, pp. 75–79.
- “Emotional voice conversion using multitask learning with text-to-speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, pp. 7774–7778.
- “A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 6294–6298.
- “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021, pp. 920–924.
- “Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training,” arXiv preprint arXiv:2103.16809, 2021.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
- “End-to-end adversarial text-to-speech,” in International Conference on Learning Representations, 2020.
- “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
- “Period vits: Variational inference with explicit pitch modeling for end-to-end emotional speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2023, pp. 1–5.
- “Analysis of emotionally salient aspects of fundamental frequency for emotion detection,” IEEE transactions on audio, speech, and language processing, vol. 17, no. 4, pp. 582–596, 2009.
- “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022, pp. 7237–7241.
- James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
- “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
- “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” 2019.
- “Stargan v2: Diverse image synthesis for multiple domains,” 2020.