Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis (2309.11849v1)

Published 21 Sep 2023 in cs.SD, cs.CL, and eess.AS

Abstract: This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in International Conference on Acoustics, Speech, and Signal Processing, 2018.
  2. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. A. J. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv: Computation and Language, 2018.
  3. Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International Conference on Machine Learning, 2018.
  4. D. Stanton, Y. Wang, and R. Skerry-Ryan, “Predicting expressive speaking style from text in end-to-end speech synthesis,” in Spoken Language Technology Workshop, 2018.
  5. R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in International Conference on Acoustics, Speech, and Signal Processing, 2020.
  6. M. Whitehill, S. Ma, D. McDuff, and Y. Song, “Multi-reference neural tts stylization with adversarial cycle consistency.” in Conference of the International Speech Communication Association, 2020.
  7. S. Gururani, K. Gupta, D. Shah, Z. Shakeri, and J. Pinto, “Prosody transfer in neural text to speech using global pitch and loudness features.” arXiv: Sound, 2019.
  8. Y. Lei, S. Yang, and L. Xie, “Fine-grained emotion strength transfer, control and prediction for emotional speech synthesis,” in Spoken Language Technology Workshop, 2021.
  9. X. Li, C. Song, J. Li, Z. Wu, J. Jia, and H. Meng, “Towards multi-scale style control for expressive speech synthesis,” in Conference of the International Speech Communication Association, 2021.
  10. H.-T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, “Adapting and controlling dnn-based speech synthesis using input codes,” in International Conference on Acoustics, Speech, and Signal Processing, 2017.
  11. N. Obin, J. Beliao, C. Veaux, and A. Lacheret, “Slam: Automatic stylization and labelling of speech melody,” Speech prosody, pp. 246–250, 2014.
  12. J. Lorenzo-Trueba, G. E. Henter, S. Takaki, J. Yamagishi, Y. Morino, and Y. Ochiai, “Investigating different representations for modeling and controlling multiple emotions in dnn-based speech synthesis,” Speech Communication, vol. 99, pp. 135–143, 2018.
  13. Y. Lee and T. Kim, “Robust and fine-grained prosody control of end-to-end speech synthesis,” in International Conference on Acoustics, Speech, and Signal Processing, 2019.
  14. G. Sun, Y. Zhang, R. Weiss, Y. Cao, H. Zen, and Y. Wu, “Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis,” in International Conference on Acoustics, Speech, and Signal Processing, 2020.
  15. M. Morrison, Z. Jin, J. Salamon, N. J. Bryan, and G. J. Mysore, “Controllable neural prosody synthesis.” in Conference of the International Speech Communication Association, 2020.
  16. Q. Xie, T. Li, X. Wang, Z. Wang, L. Xie, G. Yu, and G. Wan, “Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios,” 2021.
  17. Y. Hono, K. Tsuboi, K. Sawada, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Hierarchical multi-grained generative model for expressive speech synthesis.” in Conference of the International Speech Communication Association, 2020.
  18. S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. J. Miller, A. Y. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” arXiv: Computation and Language, 2017.
  19. A. Lancucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in International Conference on Acoustics, Speech, and Signal Processing, 2021.
  20. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2021.
  21. A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” 2018.
  22. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics, 2018.
  23. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in North American Chapter of the Association for Computational Linguistics, 2018.
  24. Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-trained models for Chinese natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings.   Online: Association for Computational Linguistics, Nov. 2020, pp. 657–668. [Online]. Available: https://www.aclweb.org/anthology/2020.findings-emnlp.58
  25. T. Mihaylov and A. Frank, “Discourse-aware semantic self-attention for narrative reading comprehension,” 2019.
  26. D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification,” in Empirical Methods in Natural Language Processing, 2015.
  27. S. Angelidis and M. Lapata, “Multiple instance learning networks for fine-grained sentiment analysis,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 17–31, 2018.
  28. M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in Conference of the International Speech Communication Association, 2017.
  29. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Neural Information Processing Systems, 2019.
  30. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, pp. 1735–1780, 1997.
  31. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations, 2015.
  32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv: Learning, 2014.
  33. J.-M. Valin and J. Skoglund, “Lpcnet: Improving neural speech synthesis through linear prediction,” in International Conference on Acoustics, Speech, and Signal Processing, 2019.
  34. Z. Huang, L. Chen, and M. P. Harper, “An open source prosodic feature extraction tool,” in Language Resources and Evaluation, 2006.

Summary

We haven't generated a summary for this paper yet.