Multimodal Emotion Recognition from Raw Audio with Sinc-convolution (2402.11954v1)
Abstract: Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform for training neural network is becoming an emerging trend. This approach is advantageous as it eliminates the feature extraction pipeline. Learning from time-domain signal has shown good results for tasks such as speech recognition, speaker verification etc. In this paper, we utilize Sinc-convolution layer, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM). We also incorporate linguistic features and append a dialogical emotion decoding (DED) strategy. Our approach achieves a weighted accuracy of 85.1\% in four class emotion on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset.
- “Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011,” Artificial Intelligence Review, vol. 43, pp. 155–177, 2015.
- “Fusion approaches for emotion recognition from speech using acoustic and text-based features,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6484–6488.
- “Efficient emotion recognition from speech using deep learning on spectrograms,” in Proceedings of the IEEE. IEEE, 2017, pp. 1089–1093.
- “Multi-time-scale convolution for emotion recognition from speech audio signals,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6489–6493.
- “Hgfm : A hierarchical grained and feature model for acoustic emotion recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6499–6503.
- “What to remember: Self-adaptive continual learning for audio deepfake detection,” arXiv preprint arXiv:2312.09651, 2023.
- “Do you remember? Overcoming catastrophic forgetting for fake audio detection,” in Proceedings of the 40th International Conference on Machine Learning, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, Eds. 23–29 Jul 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 41819–41831, PMLR.
- M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with sincnet,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 1021–1028.
- “Multimodal representation learning by alternating unimodal adaptation,” 2023.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
- “A dialogical emotion decoder for speech emotion recognition in spoken dialog,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6479–6483.
- “Deep sentiment representation based on cnn and lstm,” in 2017 International Conference on Green Informatics (ICGI), 2017, pp. 30–33.
- “Self-attention-based bilstm model for short text fine-grained sentiment classification,” IEEE Access, vol. 7, pp. 180558–180570, 2019.
- “Multimodal speech emotion recognition using audio and text,” 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 112–118, 2018.
- “Speech emotion recognition using multi-hop attention mechanism,” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822–2826, 2019.
- “An interaction-aware attention network for speech emotion recognition in spoken dialogs,” ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6685–6689, 2019.
- Xiaohui Zhang (105 papers)
- Wenjie Fu (9 papers)
- Mangui Liang (3 papers)