Papers
Topics
Authors
Recent
Search
2000 character limit reached

M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition

Published 23 Sep 2025 in cs.HC | (2509.18706v1)

Abstract: Multimodal speech emotion recognition (SER) has emerged as pivotal for improving human-machine interaction. Researchers are increasingly leveraging both speech and textual information obtained through automatic speech recognition (ASR) to comprehensively recognize emotional states from speakers. Although this approach reduces reliance on human-annotated text data, ASR errors possibly degrade emotion recognition performance. To address this challenge, in our previous work, we introduced two auxiliary tasks, namely, ASR error detection and ASR error correction, and we proposed a novel multimodal fusion (MF) method for learning modality-specific and modality-invariant representations across different modalities. Building on this foundation, in this paper, we introduce two additional training strategies. First, we propose an adversarial network to enhance the diversity of modality-specific representations. Second, we introduce a label-based contrastive learning strategy to better capture emotional features. We refer to our proposed method as M4SER and validate its superiority over state-of-the-art methods through extensive experiments using IEMOCAP and MELD datasets.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.