Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Grained Quantitative Emotion Editing for Speech Generation (2403.02002v2)

Published 4 Mar 2024 in cs.SD and eess.AS

Abstract: It remains a significant challenge how to quantitatively control the expressiveness of speech emotion in speech generation. In this work, we present a novel approach for manipulating the rendering of emotions for speech generation. We propose a hierarchical emotion distribution extractor, i.e. Hierarchical ED, that quantifies the intensity of emotions at different levels of granularity. Support vector machines (SVMs) are employed to rank emotion intensity, resulting in a hierarchical emotional embedding. Hierarchical ED is subsequently integrated into the FastSpeech2 framework, guiding the model to learn emotion intensity at phoneme, word, and utterance levels. During synthesis, users can manually edit the emotional intensity of the generated voices. Both objective and subjective evaluations demonstrate the effectiveness of the proposed network in terms of fine-grained quantitative emotion editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “An overview of affective speech synthesis and conversion in the deep learning era,” Proceedings of the IEEE, 2023.
  2. ZHOU KUN, “Emotion modelling for speech generation,” 2022.
  3. Julia Hirschberg, “Pragmatics and intonation,” The handbook of pragmatics, pp. 515–537, 2006.
  4. “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern recognition, vol. 44, no. 3, pp. 572–587, 2011.
  5. Björn W Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018.
  6. Emotion Recognition Using Prosodic Information, pp. 79–91, Springer New York, New York, NY, 2013.
  7. “Converting anyone’s emotion: Towards speaker-independent emotional voice conversion,” Proc. Interspeech 2020, pp. 3416–3420, 2020.
  8. “Vaw-gan for disentanglement and recomposition of emotional elements in speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 415–422.
  9. “Disentanglement of emotional style and speaker identity for expressive voice conversion,” arXiv preprint arXiv:2110.10326, 2021.
  10. “Rich prosody diversity modelling with phone-level mixture density network,” 2021.
  11. Noé Tits, “Controlling the emotional expressiveness of synthetic speech: a deep learning approach,” 2022.
  12. “Editts: Score-based editing for controllable text-to-speech,” 2022.
  13. “Context-aware prosody correction for text-based speech editing,” 2021.
  14. “Mixed-evc: Mixed emotion synthesis and control in voice conversion,” 2023.
  15. Yi Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011.
  16. “Multilevel parametric-base f0 model for speech synthesis,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
  17. “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data,” in Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 2020, pp. 230–237.
  18. Emma Rodero, “Intonation and emotion: Influence of pitch levels and contour type on creating emotions,” Journal of Voice, vol. 25, no. 1, pp. e25–e34, 2011.
  19. “The influence of pitch range, duration, amplitude and spectral features on the interpretation of the rise-fall-rise intonation contour in english,” Journal of Phonetics, vol. 20, no. 2, pp. 241–251, 1992.
  20. “Deep bidirectional lstm modeling of timbre and prosody for emotional voice conversion.,” in Interspeech, 2016, pp. 2453–2457.
  21. “Towards multi-scale speaking style modelling with hierarchical context information for mandarin speech synthesis,” 2022.
  22. “Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition,” 2021.
  23. “Controllable emotion transfer for end-to-end speech synthesis,” 2020.
  24. “Adjusting pleasure-arousal-dominance for continuous emotional text-to-speech synthesizer,” in Interspeech, 2019.
  25. “Controlling emotion strength with relative attribute for end-to-end speech synthesis,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 192–199.
  26. “Emotion intensity and its control for emotional voice conversion,” IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2023.
  27. “Speech synthesis with mixed emotions,” IEEE Transactions on Affective Computing, vol. 14, no. 4, pp. 3120–3134, 2023.
  28. “Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis,” 2020.
  29. “Fine-grained emotional control of text-to-speech: Learning to rank inter- and intra-class emotion intensities,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  30. “Emoq-tts: Emotion intensity quantization for fine-grained controllable emotional text-to-speech,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6317–6321.
  31. “Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” 2022.
  32. “Editspeech: A text based speech editing system using partial inference and bidirectional fusion,” 2021.
  33. “Relative attributes,” in 2011 International Conference on Computer Vision. IEEE, 2011, pp. 503–510.
  34. “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  35. “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Proc. Interspeech 2017, 2017, pp. 498–502.
  36. “opensmile – the munich versatile and fast open-source audio feature extractor,” 01 2010, pp. 1459–1462.
  37. “The blizzard challenge 2013,” 2014.
  38. “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  39. “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.
  40. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2022.
  41. “Attention is all you need,” 2017.
  42. “Adam: A method for stochastic optimization,” 2017.
  43. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
  44. R. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” in Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1993, vol. 1, pp. 125–128 vol.1.
  45. “On the analysis and evaluation of prosody conversion techniques,” in 2017 International Conference on Asian Language Processing (IALP), 2017, pp. 44–47.
  46. “Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Regina Barzilay and Min-Yen Kan, Eds., Vancouver, Canada, July 2017, pp. 465–470, Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sho Inoue (8 papers)
  2. Kun Zhou (217 papers)
  3. Shuai Wang (466 papers)
  4. Haizhou Li (286 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com