Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks (2312.01568v1)

Published 4 Dec 2023 in cs.HC, cs.SD, and eess.AS

Abstract: Emotion recognition is a topic of significant interest in assistive robotics due to the need to equip robots with the ability to comprehend human behavior, facilitating their effective interaction in our society. Consequently, efficient and dependable emotion recognition systems supporting optimal human-machine communication are required. Multi-modality (including speech, audio, text, images, and videos) is typically exploited in emotion recognition tasks. Much relevant research is based on merging multiple data modalities and training deep learning models utilizing low-level data representations. However, most existing emotion databases are not large (or complex) enough to allow machine learning approaches to learn detailed representations. This paper explores modalityspecific pre-trained transformer frameworks for self-supervised learning of speech and text representations for data-efficient emotion recognition while achieving state-of-the-art performance in recognizing emotions. This model applies feature-level fusion using nonverbal cue data points from motion capture to provide multimodal speech emotion recognition. The model was trained using the publicly available IEMOCAP dataset, achieving an overall accuracy of 77.58% for four emotions, outperforming state-of-the-art approaches

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Q. Gao, H. Ning, and B. Du, “A survey on the development of intelligent robots in speech emotion recognition,” 2021 International Wireless Communications and Mobile Computing (IWCMC), pp. 951–956, 2021.
  2. M. Tahon, M. Macary, Y. Estève, and D. Luzzati, “Mutual impact of acoustic and linguistic representations for continuous emotion recognition in call-center conversations,” 2021.
  3. B. T. Atmaja and M. Akagi, “Speech emotion recognition based on speech segment using lstm with attention model,” 2019 IEEE International Conference on Signals and Systems (ICSigSys), pp. 40–44, 2019.
  4. R. Zall and M. R. Kangavari, “Comparative analytical survey on cognitive agents with emotional intelligence,” Cogn. Comput., vol. 14, pp. 1223–1246, 2022.
  5. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks : the official journal of the International Neural Network Society, vol. 61, pp. 85–117, 2015.
  6. I. J. Goodfellow, Y. Bengio, and A. C. Courville, “Deep learning,” Nature, vol. 521, pp. 436–444, 2015.
  7. E. Lieskovska, M. Jakubec, R. Jarina, and M. Chmulik, “A review on speech emotion recognition using deep learning and attention mechanism,” Electronics, vol. 10, p. 1163, 2021.
  8. C. Busso, M. Bulut, C.-C. Lee, E. A. Kazemzadeh, E. M. Provost, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
  9. S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLoS ONE, vol. 13, 2018.
  10. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” ArXiv, vol. abs/1810.02508, 2019.
  11. A. Baevski, H. Zhou, A. rahman Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” ArXiv, vol. abs/2006.11477, 2020.
  12. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol. abs/1810.04805, 2019.
  13. R. A. Patamia, W. Jin, K. N. Acheampong, K. A. Sarpong, and E. K. Tenagyei, “Transformer based multimodal speech emotion recognition with improved neural networks,” 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), pp. 195–203, 2021.
  14. A. Hazra, P. Choudhary, and M. S. Singh, “Recent advances in deep learning techniques and its applications: An overview,” 2020.
  15. S. Bhosale, R. Chakraborty, and S. K. Kopparapu, “Deep encoded linguistic and acoustic cues for attention based end to end speech emotion recognition,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7189–7193, 2020.
  16. . Mustaqeem, M. Sajjad, and S. Kwon, “Clustering-based speech emotion recognition by incorporating learned features and deep bilstm,” IEEE Access, vol. 8, pp. 79 861–79 875, 2020.
  17. M. A. Jalal, R. Milner, and T. Hain, “Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition,” in INTERSPEECH, 2020.
  18. L. Pepino, P. E. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” in Interspeech, 2021.
  19. L.-W. Chen and A. I. Rudnicky, “Exploring wav2vec 2.0 fine-tuning for improved speech emotion recognition,” ArXiv, vol. abs/2110.06309, 2021.
  20. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv, vol. abs/1607.06450, 2016.
  21. D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv: Learning, 2016.
  22. A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” Proceedings of the 23rd international conference on Machine learning, 2006.
  23. S. Siriwardhana, A. Reis, R. Weerasekera, and S. Nanayakkara, “Jointly fine-tuning ”bert-like” self supervised models to improve multimodal speech emotion recognition,” ArXiv, vol. abs/2008.06682, 2020.
  24. S. Tripathi and H. S. M. Beigi, “Multi-modal emotion recognition on iemocap with neural networks,” 2018.
  25. N.-H. Ho, H.-J. Yang, S. Kim, and G. Lee, “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61 672–61 686, 2020.
  26. N. KrishnaD. and A. Patil, “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” in INTERSPEECH, 2020.
  27. M. Liu, N. Xue, and M. Huo, “Multimodal speech emotion recognition based on aligned attention mechanism,” 2021 IEEE International Conference on Unmanned Systems (ICUS), pp. 802–808, 2021.
  28. P. Liu, K. Li, and H. M. Meng, “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” ArXiv, vol. abs/2201.06309, 2020.
  29. M. Chen and X. Zhao, “A multi-scale fusion framework for bimodal speech emotion recognition,” in INTERSPEECH, 2020.
  30. M. R. Makiuchi, K. Uto, and K. Shinoda, “Multimodal emotion recognition with high-level speech and text features,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 350–357, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rutherford Agbeshi Patamia (1 paper)
  2. Paulo E. Santos (10 papers)
  3. Kingsley Nketia Acheampong (2 papers)
  4. Favour Ekong (1 paper)
  5. Kwabena Sarpong (1 paper)
  6. She Kun (1 paper)
Citations (1)

Summary

We haven't generated a summary for this paper yet.