Speech Emotion Recognition with Dual-Sequence LSTM Architecture (1910.08874v4)

Published 20 Oct 2019 in eess.AS, cs.LG, and cs.SD

Abstract: Speech Emotion Recognition (SER) has emerged as a critical component of the next generation human-machine interfacing technologies. In this work, we propose a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals. Each utterance is preprocessed into MFCC features and two mel-spectrograms at different time-frequency resolutions. A standard LSTM processes the MFCC features, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3%---a 6% improvement over current state-of-the-art unimodal models---and is comparable with multimodal models that leverage textual information as well as audio signals.

Authors (6)

Jianyou Wang (9 papers)
Michael Xue (1 paper)
Ryan Culhane (1 paper)
Enmao Diao (25 papers)
Jie Ding (123 papers)
Vahid Tarokh (144 papers)

Citations (101)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Speech Emotion Recognition with Dual-Sequence LSTM Architecture (1910.08874v4)

Summary

Related Papers