A Multi-Task, Multi-Modal Approach for Predicting Categorical and Dimensional Emotions (2401.00536v1)
Abstract: Speech emotion recognition (SER) has received a great deal of attention in recent years in the context of spontaneous conversations. While there have been notable results on datasets like the well known corpus of naturalistic dyadic conversations, IEMOCAP, for both the case of categorical and dimensional emotions, there are few papers which try to predict both paradigms at the same time. Therefore, in this work, we aim to highlight the performance contribution of multi-task learning by proposing a multi-task, multi-modal system that predicts categorical and dimensional emotions. The results emphasise the importance of cross-regularisation between the two types of emotions. Our approach consists of a multi-task, multi-modal architecture that uses parallel feature refinement through self-attention for the feature of each modality. In order to fuse the features, our model introduces a set of learnable bridge tokens that merge the acoustic and linguistic features with the help of cross-attention. Our experiments for categorical emotions on 10-fold validation yield results comparable to the current state-of-the-art. In our configuration, our multi-task approach provides better results compared to learning each paradigm separately. On top of that, our best performing model achieves a high result for valence compared to the previous multi-task experiments.
- LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6912–6916.
- Designing and Evaluating Speech Emotion Recognition Systems: A reality check case study with IEMOCAP. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.
- Speech emotion recognition with multi-task learning.. In Interspeech, Vol. 2021. 4508–4512.
- Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 19–26.
- End-to-end speech emotion recognition: challenges of real-life emergency call centers data recordings. In 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 1–8.
- Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion 6, 3-4 (1992), 169–200.
- End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model.. In Interspeech. 501–505.
- Yajing Feng and Laurence Devillers. 2023. End-to-End Continuous Speech Emotion Recognition in Real-life Customer Service Call Center Conversations. Workshop on addressing social context in affective computing (ASOCA) at ACII2023 (accepted paper) (2023).
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021).
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
- Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. ICLR (2015).
- James A Russell and Albert Mehrabian. 1977. Evidence for a three-factor theory of emotions. Journal of research in Personality 11, 3 (1977), 273–294.
- Meysam Shamsi and Marie Tahon. 2022. Training speech emotion classifier without categorical annotations. arXiv preprint arXiv:2210.07642 (2022).
- Unifying the Discrete and Continuous Emotion labels for Speech Emotion Recognition. arXiv preprint arXiv:2210.16642 (2022).
- Representation learning through cross-modal conditional teacher-student training for speech emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6442–6446.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
- Exploring Complementary Features in Multi-Modal Speech Emotion Recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735 (2021).
- Rui Xia and Yang Liu. 2015. A multi-task learning framework for emotion recognition using 2D continuous space. IEEE Transactions on affective computing 8, 1 (2015), 3–14.
- Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7367–7371.
- Alex-Răzvan Ispas (1 paper)
- Théo Deschamps-Berger (4 papers)
- Laurence Devillers (6 papers)