Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech (2309.11014v1)

Published 20 Sep 2023 in eess.AS, cs.SD, and eess.SP

Abstract: Speech emotion recognition has evolved from research to practical applications. Previous studies of emotion recognition from speech have focused on developing models on certain datasets like IEMOCAP. The lack of data in the domain of emotion modeling emerges as a challenge to evaluate models in the other dataset, as well as to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning to fuse results of pre-trained models for emotion share recognition from speech. The models were chosen to accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model with a single model and the previous best model from the late fusion. The performance is measured using the Spearman rank correlation coefficient since the task is a regression problem with ranking values. A Spearman rank correlation coefficient of 0.537 is reported for the test set, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. A. S. Cowen, H. A. Elfenbein, P. Laukka, and D. Keltner, “Mapping 24 emotions conveyed by brief human vocalization.” Am. Psychol., vol. 74, no. 6, pp. 698–712, sep 2019.
  2. S. Buechel and U. Hahn, “Emotion analysis as a regression problem-dimensional models and their implications on Emotion representation and metrical evaluation,” Front. Artif. Intell. Appl., vol. 285, pp. 1114–1122, 2016.
  3. A. Radford, J. Wook, K. Tao, X. Greg, B. Christine, and M. Ilya, “Robust Speech Recognition via Large-Scale Weak Supervision,” openai.com, 2021.
  4. V. Pratap, M. Auli, and M. Ai, “Scaling Speech Technology to 1 , 000 + Languages.”
  5. S. G. Upadhyay, L. Martinez-Lucas, B.-h. Su, W.-c. Lin, W.-s. Chien, Y.-t. Wu, W. Katz, C. Busso, and C.-c. Lee, “Phonetic Anchor-Based Transfer Learning to Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition,” in ICASSP 2023 - 2023 IEEE Int. Conf. Acoust. Speech Signal Process.   IEEE, jun 2023, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10095250/
  6. S. Li, P. Song, L. Ji, Y. Jin, and W. Zheng, “A Generalized Subspace Distribution Adaptation Framework for Cross-Corpus Speech Emotion Recognition,” in ICASSP 2023 - 2023 IEEE Int. Conf. Acoust. Speech Signal Process.   IEEE, jun 2023, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10097258/
  7. Y. Zhao, J. Wang, Y. Zong, W. Zheng, H. Lian, and L. Zhao, “DEEP IMPLICIT DISTRIBUTION ALIGNMENT NETWORKS FOR CROSS-CORPUS SPEECH EMOTION RECOGNITION,” ICASSP 2023 - 2023 IEEE Int. Conf. Acoust. Speech Signal Process.
  8. B. T. Atmaja, Y. Hamada, and M. Akagi, “Predicting Valence and Arousal by Aggregating Acoustic Features for Acoustic-Linguistic Information Fusion,” in 2020 IEEE Reg. 10 Conf.   IEEE, nov 2020, pp. 1081–1085.
  9. B. T. Atmaja and M. Akagi, “Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information,” in Proc. 2020 23rd Conf. Orient. COCOSDA Int. Comm. Co-ord. Stand. Speech Databases Assess. Tech. O-COCOSDA 2020.   IEEE, nov 2020, pp. 166–171.
  10. J. Zhao and S. Chen, “Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions,” in Cross-cultural Emot. Sub-challenge AVEC’18, 2018, pp. 65–72.
  11. B. T. Atmaja and M. Akagi, “The Effect of Silence Feature in Dimensional Speech Emotion Recognition,” in 10th Int. Conf. Speech Prosody 2020, no. May.   ISCA, may 2020, pp. 26–30.
  12. P.-Y. Shih, C.-P. Chen, and C.-H. Wu, “Speech Emotion Recognition With Ensemble Learning Methods,” in IEEE Int. Conf. Acoust. Speech, Signal Process. 2017, 2017, pp. 2756–2760.
  13. B. Schuller, A. Batliner, S. Amiriparian, C. Bergler, M. Gerczuk, N. Holz, P. Larrouy-Maestri, S. Bayerl, K. Riedhammer, A. Mallol-Ragolta, M. Pateraki, H. Coppock, I. Kiskin, M. Sinka, and S. Roberts, “The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share and Requests,” in Proc. 30th ACM Int. Conf. Multimed.   New York, NY, USA: ACM, oct 2023, pp. 7120–7124. [Online]. Available: https://dl.acm.org/doi/10.1145/3503161.3551591
  14. A. S. Cowen, P. Laukka, H. A. Elfenbein, R. Liu, and D. Keltner, “The primacy of categories in the recognition of 12 emotions in speech prosody across two cultures,” Nat. Hum. Behav., vol. 3, no. 4, pp. 369–382, apr 2019. [Online]. Available: http://www.nature.com/articles/s41562-019-0533-6
  15. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations,” in Proc. 57th Annu. Meet. Assoc. Comput. Linguist., 2019, pp. 527–536.
  16. P. Laukka, H. A. Elfenbein, W. Chui, N. S. Thingujam, F. K. Iraki, T. Rockstuhl, and J. Althoff, “Presenting the VENEC Corpus: Development of a Cross-Cultural Corpus of Vocal Emotion Expressions and a Novel Method of Annotating Emotion Appraisals,” Lr. 2010 - Seventh Int. Conf. Lang. Resour. Eval., vol. 7, pp. 53–57, 2010.
  17. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst., 2020.
  18. J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–13, mar 2023.
  19. X. Cai, Z. Wu, K. Zhong, B. Su, D. Dai, and H. Meng, “Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network,” in 2021 12th Int. Symp. Chinese Spok. Lang. Process. ISCSLP 2021, 2021, pp. 3–7.
  20. A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Interspeech 2022, vol. 2022-Septe.   ISCA: ISCA, sep 2022, pp. 2278–2282. [Online]. Available: https://www.isca-speech.org/archive/interspeech_2022/babu22_interspeech.html
  21. C. Spearman, “The Proof and Measurement of Association between Two Things,” Am. J. Psychol., vol. 100, no. 3/4, p. 441, 1987. [Online]. Available: https://www.jstor.org/stable/1422689?origin=crossref
  22. M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B. Schuller, “auDeep: Unsupervised learning of representations from audio with deep recurrent neural networks,” J. Mach. Learn. Res., vol. 18, pp. 1–5, 2018.
  23. S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, and B. Schuller, “Snore Sound Classification Using Image-Based Deep Spectrum Features,” in Interspeech 2017.   ISCA: ISCA, aug 2017, pp. 3512–3516.
  24. F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the munich open-source multimedia feature extractor,” in Proc. 21st ACM Int. Conf. Multimed. - MM ’13.   New York, New York, USA: ACM Press, 2013, pp. 835–838.
  25. B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, M. Mortillaro, H. Salamin, A. Polychroniou, F. Valente, and S. Kim, “The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism,” in Interspeech 2013, no. August.   ISCA: ISCA, aug 2013, pp. 148–152.
Citations (1)

Summary

We haven't generated a summary for this paper yet.