Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech (2403.02167v3)

Published 4 Mar 2024 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios, such as TV shows, limiting their application in real-world contexts. We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models. We compared the results with reference datasets including acted and elicited speech, and analyzed the influence of annotators and gender fairness. The pre-trained UniSpeech-SAT-Large model achieved the highest results, 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively on EMOVOME, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS dataset. The elicited IEMOCAP dataset also outperformed EMOVOME in predicting emotion categories, while similar results were obtained in valence and arousal. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations. This study highlights the gap between controlled and real-life scenarios, supporting further advancements in recognizing genuine emotions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. M. Swain, A. Routray, and P. Kabisatpathy, “Databases, features and classifiers for speech emotion recognition: a review,” International Journal of Speech Technology, vol. 21, no. 1, pp. 93–120, 2018.
  2. M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
  3. S. Madanian, T. Chen, O. Adeleye, J. M. Templeton, C. Poellabauer, D. Parry, and S. L. Schneider, “Speech emotion recognition using machine learning—a systematic review,” Intelligent systems with applications, p. 200266, 2023.
  4. P. Ekman, “Basic emotions,” Handbook of cognition and emotion, vol. 98, no. 45-60, p. 16, 1999.
  5. J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
  6. J. A. Russell and L. F. Barrett, “Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant.” Journal of personality and social psychology, vol. 76, no. 5, p. 805, 1999.
  7. S. Srinivasan, Z. Huang, and K. Kirchhoff, “Representation learning through cross-modal conditional teacher-student training for speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6442–6446.
  8. S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin et al., “Superb: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
  9. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  10. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  11. J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  12. L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech 2021, 2021, pp. 3400–3404.
  13. R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice,” American scientist, vol. 89, no. 4, pp. 344–350, 2001.
  14. Y. Ding, J. You, T.-K. Machulla, J. Jacobs, P. Sen, and T. Höllerer, “Impact of annotator demographics on sentiment dataset labeling,” Proceedings of the ACM on Human-Computer Interaction, vol. 6, no. CSCW2, pp. 1–22, 2022.
  15. I. Iriondo, R. Guaus, A. Rodríguez, P. Lázaro, N. Montoya, J. M. Blanco, D. Bernadas, J. M. Oliver, D. Tena, and L. Longhi, “Validation of an acoustical modelling of emotional expression in spanish using speech synthesis techniques,” in ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, 2000.
  16. J. M. Montero, J. Gutiérrez-Arriola, J. Colás, E. Enriquez, and J. M. Pardo, “Analysis and modelling of emotional speech in spanish,” in Proceedings of international conference on phonetic sciences, vol. 2, 1999, pp. 957–960.
  17. C. Á. Martínez and A. B. Cruz, “Emotion recognition in non-structured utterances for human-robot interaction,” in ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005.   IEEE, 2005, pp. 19–23.
  18. S.-O. Caballero-Morales, “Recognition of emotions in mexican spanish speech: An approach based on acoustic modelling of emotion-specific vowels,” The Scientific World Journal, vol. 2013, 2013.
  19. R. Barra-Chicote, J. Montero, J. Macias-Guarasa, S. Lufti, J. Lucas, F. Fernandez, L. D’haro, R. San-Segundo, J. Ferreiros, R. Cordoba et al., “Spanish expressive voices: Corpus for emotion research in spanish,” in Proc. of LREC.   Citeseer, 2008.
  20. J. M. López, I. Cearreta, I. Fajardo, and N. Garay, “Validating a multilingual and multimodal affective database,” in International Conference on Usability and Internationalization.   Springer, 2007, pp. 422–431.
  21. M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate, “The mexican emotional speech database (mesd): elaboration and assessment based on machine learning,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).   IEEE, 2021, pp. 1644–1647.
  22. “Emotional speech synthesis database elra-s0329,” https://catalog.elra.info/en-us/repository/browse/ELRA-S0329/, 2011, accessed: 2023-12-30.
  23. E. Garcia-Cuesta, A. B. Salvador, and D. G. Pãez, “Emomatchspanishdb: study of speech emotion recognition machine learning models in a new spanish elicited database,” Multimedia Tools and Applications, pp. 1–20, 2023.
  24. V. P. Rosas, R. Mihalcea, and L.-P. Morency, “Multimodal sentiment analysis of spanish online videos,” IEEE intelligent Systems, vol. 28, no. 3, pp. 38–45, 2013.
  25. A. Zadeh, Y. S. Cao, S. Hessner, P. P. Liang, S. Poria, and L.-P. Morency, “Cmu-moseas: A multimodal language dataset for spanish, portuguese, german and french,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2020.   NIH Public Access, 2020, p. 1801.
  26. E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller, “Categorical vs dimensional perception of italian emotional speech,” in Proc. Interspeech 2018, 2018, pp. 3638–3642.
  27. C. Gorrostieta, R. Lotfian, K. Taylor, R. Brutti, and J. Kane, “Gender De-Biasing in Speech Emotion Recognition,” in Proc. Interspeech 2019, 2019, pp. 2823–2827.
  28. H. Liu, Y. Wang, W. Fan, X. Liu, Y. Li, S. Jain, Y. Liu, A. Jain, and J. Tang, “Trustworthy ai: A computational perspective,” ACM Trans. Intell. Syst. Technol., vol. 14, no. 1, 2022.
  29. L. Gómez-Zaragozá, R. del Amor, E. P. Vargas, V. Naranjo, M. A. Raya, and J. Marín-Morales, “Emotional voice messages (emovome) database: emotion recognition in spontaneous voice messages,” 2024.
  30. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  31. S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, p. e0196391, 2018.
  32. B. T. Atmaja and A. Sasou, “Multilingual, cross-lingual, and monolingual speech emotion recognition on emofilm dataset,” in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2023, pp. 1019–1025.
  33. V. Hozjan, Z. Kacic, A. Moreno, A. Bonafonte, and A. Nogueiras, “Interface databases: Design and collection of a multilingual emotional speech database.” in Proceedings of the 3rd international conference on language (LREC’02, 2002.
  34. Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede: A video database for affective content analysis,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015.
  35. H. Pérez-Espinosa, C. A. Reyes-García, and L. Villaseñor-Pineda, “Emowisconsin: an emotional children speech database in mexican spanish,” in Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, October 9–12, 2011, Proceedings, Part II.   Springer, 2011, pp. 62–71.
  36. H. Pérez-Espinosa, J. Martínez-Miranda, I. Espinosa-Curiel, J. Rodríguez-Jacobo, L. Villaseñor-Pineda, and H. Avila-George, “Iesc-child: an interactive emotional children’s speech corpus,” Computer Speech & Language, vol. 59, pp. 55–74, 2020.
  37. A. Arruti, I. Cearreta, A. Álvarez, E. Lazkano, and B. Sierra, “Feature selection for speech emotion recognition in spanish and basque: on the use of machine learning to improve human-computer interaction,” PloS one, vol. 9, no. 10, p. e108975, 2014.
  38. M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate, “Mexican emotional speech database based on semantic, frequency, familiarity, concreteness, and cultural shaping of affective prosody,” Data, vol. 6, no. 12, p. 130, 2021.
  39. L. Kerkeni, Y. Serrestou, K. Raoof, M. Mbarki, M. A. Mahjoub, and C. Cleder, “Automatic speech emotion recognition using an optimal combination of features based on emd-tkeo,” Speech Communication, vol. 114, pp. 22–35, 2019.
  40. G. Assunção and P. Menezes, “Intermediary fuzzification in speech emotion recognition,” in 2020 IEEE international conference on fuzzy systems (FUZZ-IEEE).   IEEE, 2020, pp. 1–6.
  41. L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic speech emotion recognition using machine learning,” 2019.
  42. G. Costantini, E. Parada-Cabaleiro, D. Casali, and V. Cesarini, “The emotion probe: On the universality of cross-linguistic and cross-gender speech emotion recognition via machine learning,” Sensors, vol. 22, no. 7, p. 2461, 2022.
  43. G. McIntyre and R. Göcke, “The composite sensing of affect,” in Affect and Emotion in Human-Computer Interaction: From Theory to Applications.   Springer, 2008, pp. 104–115.
  44. M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
  45. O. C. Phukan, A. B. Buduru, and R. Sharma, “A comparative study of pre-trained speech and audio embeddings for speech emotion recognition,” arXiv preprint arXiv:2304.11472, 2023.
  46. B. T. Atmaja and A. Sasou, “Evaluating self-supervised speech representations for speech emotion recognition,” IEEE Access, vol. 10, pp. 124 396–124 407, 2022.
  47. R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2017.
  48. M. Li, B. Yang, J. Levy, A. Stolcke, V. Rozgic, S. Matsoukas, C. Papayiannis, D. Bone, and C. Wang, “Contrastive unsupervised learning for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6329–6333.
  49. K. Kovacek, S. Livingstone, G. Singh, and F. A. Russo, “Ryerson audio-visual database of emotional speech and song (ravdess): Arousal and valence validation,” submitted to the 16th Annual McMaster NeuroMusic Virtual Conference.
  50. C. Luna-Jiménez, R. Kleinlein, D. Griol, Z. Callejas, J. M. Montero, and F. Fernández-Martínez, “A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset,” Applied Sciences, vol. 12, no. 1, 2022. [Online]. Available: https://www.mdpi.com/2076-3417/12/1/327
  51. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  52. F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
  53. F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
  54. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
  55. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 2426–2430.
  56. Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  57. W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve et al., “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training,” arXiv preprint arXiv:2104.01027, 2021.
  58. R. Pappagari, T. Wang, J. Villalba, N. Chen, and N. Dehak, “x-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7169–7173.
  59. C. L. Moine, N. Obin, and A. Roebel, “Speaker Attentive Speech Emotion Recognition,” in Proc. Interspeech 2021, 2021, pp. 2866–2870.
  60. S. Chen, Y. Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, J. Wu, Y. Qian, F. Wei, J. Li et al., “Unispeech-sat: Universal speech representation learning with speaker aware pre-training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6152–6156.
  61. D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using x-vectors.” in Odyssey, vol. 2018, 2018, pp. 105–111.
  62. M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou, S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris, H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  63. B. T. Atmaja and A. Sasou, “Ensembling multilingual pre-trained models for predicting multi-label regression emotion share from speech,” in 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).   IEEE, 2023, pp. 1026–1029.
  64. M. S. Fahad, A. Ranjan, J. Yadav, and A. Deepak, “A survey of speech emotion recognition in natural environment,” Digital signal processing, vol. 110, p. 102951, 2021.
  65. A. Triantafyllopoulos, J. Wagner, H. Wierstorf, M. Schmitt, U. Reichel, F. Eyben, F. Burkhardt, and B. W. Schuller, “Probing speech emotion recognition transformers for linguistic knowledge,” arXiv preprint arXiv:2204.00400, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.