Multimodal Emotion Recognition from Raw Audio with Sinc-convolution (2402.11954v1)

Published 19 Feb 2024 in cs.SD, cs.MM, and eess.AS

Abstract: Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform for training neural network is becoming an emerging trend. This approach is advantageous as it eliminates the feature extraction pipeline. Learning from time-domain signal has shown good results for tasks such as speech recognition, speaker verification etc. In this paper, we utilize Sinc-convolution layer, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM). We also incorporate linguistic features and append a dialogical emotion decoding (DED) strategy. Our approach achieves a weighted accuracy of 85.1\% in four class emotion on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (16)

Authors (3)

Xiaohui Zhang (105 papers)
Wenjie Fu (9 papers)
Mangui Liang (3 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/ArxivSound/status/1759806028743393380

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution (2402.11954v1)

Related Papers

Tweets