Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Disentangling Prosody Representations with Unsupervised Speech Reconstruction (2212.06972v2)

Published 14 Dec 2022 in cs.SD, cs.CL, and eess.AS

Abstract: Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. A. L. Ruba and S. D. Pollak, “The development of emotion reasoning in infancy and early childhood,” Annual Review of Developmental Psychology, vol. 2, pp. 503–531, 2020.
  2. Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, H. Wang, and B. Schüller, “Attention-enhanced connectionist temporal classification for discrete speech emotion recognition,” in Proc. INTERSPEECH.   ISCA, 2019, pp. 206–210.
  3. P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An attention pooling based representation learning method for speech emotion recognition,” in Proc. INTERSPEECH.   ISCA, 2018, pp. 3087–3091.
  4. Z. Luo, T. Takiguchi, and Y. Ariki, “Emotional voice conversion using deep neural networks with MCC and F0 features,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), 2016, pp. 1–5.
  5. Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Emotional voice conversion with adaptive scales f0 based on wavelet transform using limited amount of emotional data,” in Proc. INTERSPEECH, 2017, pp. 3399–3403.
  6. Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International Conference on Machine Learning, 2018, pp. 5180–5189.
  7. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in International Conference on Machine Learning, 2020, pp. 7836–7846.
  8. G. Zhang, S. Qiu, Y. Qin, and T. Lee, “Estimating mutual information in prosody representation for emotional prosody transfer in speech synthesis,” in International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021, pp. 1–5.
  9. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, 2022.
  10. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2018, pp. 4171–4186.
  11. A. Schirmer and R. Adolphs, “Emotion perception from face, voice, and touch: comparisons and convergence,” Trends in Cognitive Sciences, vol. 21, no. 3, pp. 216–228, 2017.
  12. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  13. F. Kreuk, A. Polyak, J. Copet, E. Kharitonov, T.-A. Nguyen, M. Rivière, W.-N. Hsu, A. Mohamed, E. Dupoux, and Y. Adi, “Textless speech emotion conversion using decomposed and discrete representations,” arXiv preprint arXiv:2111.07402, 2021.
  14. E. Kharitonov, J. Copet, K. Lakhotia, T. A. Nguyen, P. Tomasello, A. Lee, A. Elkahky, W.-N. Hsu, A. Mohamed, E. Dupoux et al., “textless-lib: a library for textless spoken language processing,” arXiv preprint arXiv:2202.07359, 2022.
  15. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 251–16 265, 2021.
  16. J.-A. Bachorowski, “Vocal expression and perception of emotion,” Current Directions in Psychological Science, vol. 8, no. 2, pp. 53–57, 1999.
  17. Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing.   IEEE, 2019, pp. 6945–6949.
  18. Y. Wang, Y. Xie, K. Zhao, H. Wang, and Q. Zhang, “Unsupervised quantized prosody representation for controllable speech synthesis,” in 2022 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2022, pp. 1–6.
  19. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning, 2019, pp. 5210–5219.
  20. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
  21. N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. of Annual Allerton Conference on Communication, Control and Computing, 2019, pp. 368–377.
  22. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in International Conference on Machine Learning, 2018, pp. 4693–4702.
  23. S. Kumar, J. Pradeep, and H. Zaidi, “Learning robust latent representations for controllable speech synthesis,” arXiv preprint arXiv:2105.04458, 2021.
  24. J. Weston, R. Lenain, U. Meepegama, and E. Fristed, “Learning de-identified representations of prosody from raw audio,” in International Conference on Machine Learning, 2021, pp. 11 134–11 145.
  25. E. Lakomkin, C. Weber, S. Magg, and S. Wermter, “Reusing neural speech representations for auditory emotion recognition,” in IJCNLP, 2017, pp. 423–430.
  26. A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
  27. D. Le, Z. Aldeneh, and E. M. Provost, “Discretized continuous speech emotion recognition with multi-task deep recurrent neural network,” in Proc. INTERSPEECH, 2017, pp. 1108–1112.
  28. B. T. Atmaja, A. Sasou, and M. Akagi, “Speech emotion and naturalness recognitions with multitask and single-task learnings,” IEEE Access, 2022.
  29. H.-C. Chou, W.-C. Lin, C.-C. Lee, and C. Busso, “Exploiting annotators’ typed description of emotion perception to maximize utilization of ratings for speech emotion recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7717–7721.
  30. Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned Wav2Vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  31. B. Şişman, H. Li, and K. C. Tan, “Transformation of prosody in voice conversion,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017, pp. 1537–1546.
  32. K. Zhou, B. Sisman, and H. Li, “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data,” in Proc. of The Speaker and Language Recognition Workshop, 2020, p. 230–237.
  33. G. Rizos, A. Baird, M. Elliott, and B. Schüller, “StarGAN for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 3502–3506.
  34. K. Zhou, B. Sisman, and H. Li, “Limited data emotional voice conversion leveraging text-to-speech: Two-stage sequence-to-sequence training,” in Proc. INTERSPEECH.   ISCA, 2021, p. 811–815.
  35. K. Zhou, B. Sisman, R. Rana, B. W. Schüller, and H. Li, “Emotion intensity and its control for emotional voice conversion,” arXiv preprint arXiv:2201.03967, 2022.
  36. Z. Luo, S. Lin, R. Liu, J. Baba, Y. Yoshikawa, and H. Ishiguro, “Decoupling speaker-independent emotions for voice conversion via source-filter networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 11–24, 2022.
  37. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 5206–5210.
  38. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH.   ISCA, 2020, pp. 3830–3834.
  39. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. INTERSPEECH, 2018, pp. 1086––1090.
  40. V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. INTERSPEECH.   ISCA, 2015, pp. 3214–3218.
  41. S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2Net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 652–662, 2019.
  42. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  43. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 4779–4783.
  44. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in Neural Information Processing Systems, vol. 28, pp. 577–585, 2015.
  45. L. Qu, C. Weber, and S. Wermter, “Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading,” IEEE Transactions on Neural Networks and Learning Systems, Aug 2022.
  46. T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  47. R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2019.
  48. C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2016.
  49. P. Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter, “The omg-emotion behavior dataset,” in International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1–7.
  50. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  51. K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and ESD,” Speech Communication, vol. 137, pp. 1–18, 2022.
  52. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. INTERSPEECH, 2019, p. 1613–1617.
  53. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.
  54. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in The International Conference on Learning Representations, 2015.
  55. S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee, “SUPERB: speech processing universal performance benchmark,” in Proc. INTERSPEECH.   ISCA, 2021, pp. 1194–1198.
  56. J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7669–7673.
  57. M. D. Pell, A. Jaywant, L. Monetta, and S. A. Kotz, “Emotional speech processing: Disentangling the effects of prosody and semantic cues,” Cognition & Emotion, vol. 25, no. 5, pp. 834–853, 2011.
  58. W. Wu, C. Zhang, and P. C. Woodland, “Emotion recognition by fusing time synchronous and time asynchronous representations,” in International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6269–6273.
  59. H. Zou, Y. Si, C. Chen, D. Rajan, and E. S. Chng, “Speech emotion recognition with co-attention based multi-level acoustic information,” in International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7367–7371.
  60. L. Tarantino, P. N. Garner, A. Lazaridis et al., “Self-attention for speech emotion recognition,” in Proc. INTERSPEECH.   ISCA, 2019, pp. 2578–2582.
  61. Z. Zhao, Z. Bao, Z. Zhang, N. Cummins, S. Sun, H. Wang, J. Tao, and B. Schüller, “Self-attention transfer networks for speech emotion recognition,” Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 43–54, 2021.
  62. S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. INTERSPEECH.   ISCA, 2019, pp. 3465–3469.
  63. M. Riviere, A. Joulin, P.-E. Mazaré, and E. Dupoux, “Unsupervised pretraining transfers well across languages,” in International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7414–7418.
  64. S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization,” arXiv preprint arXiv:2012.06659, 2020.
  65. A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning, vol. 162.   PMLR, 2022, pp. 1298–1312.
  66. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in 22-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6332–6336.
  67. P. Yue, L. Qu, S. Zheng, and T. Li, “Multi-task learning for speech emotion and emotion intensity recognition,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2022, pp. 1232–1237.
  68. W. Chen, X. Xing, X. Xu, J. Pang, and L. Du, “SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  69. Q. Cao, M. Hou, B. Chen, Z. Zhang, and G. Lu, “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6334–6338.
  70. J. Liu, Z. Liu, L. Wang, L. Guo, and J. Dang, “Speech emotion recognition with local-global aware deep representation learning,” in International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7174–7178.
  71. X. Wu, Y. Cao, H. Lu, S. Liu, D. Wang, Z. Wu, X. Liu, and H. Meng, “Speech emotion recognition using sequential capsule networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3280–3291, 2021.
  72. A. Aftab, A. Morsali, S. Ghaemmaghami, and B. Champagne, “LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6912–6916.
  73. S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
  74. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  75. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of Machine Learning Research, vol. 9, no. 11, 2008.
  76. U. Oster, “Using corpus methodology for semantic and pragmatic analyses: What can corpora tell us about the linguistic expression of emotions?” Cognitive Linguistics, vol. 21, no. 4, p. 727–763, 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Leyuan Qu (6 papers)
  2. Taihao Li (11 papers)
  3. Cornelius Weber (51 papers)
  4. Theresa Pekarek-Rosin (1 paper)
  5. Fuji Ren (18 papers)
  6. Stefan Wermter (157 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.