Ms-senet: Enhancing Speech Emotion Recognition Through Multi-scale Feature Fusion With Squeeze-and-excitation Blocks (2312.11974v2)
Abstract: Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction. Spatiotemporal features play a crucial role in SER, yet current research lacks comprehensive spatiotemporal feature learning. This paper focuses on addressing this gap by proposing a novel approach. In this paper, we employ Convolutional Neural Network (CNN) with varying kernel sizes for spatial and temporal feature extraction. Additionally, we introduce Squeeze-and-Excitation (SE) modules to capture and fuse multi-scale features, facilitating effective information fusion for improved emotion recognition and a deeper understanding of the temporal evolution of speech emotion. Moreover, we employ skip connections and Spatial Dropout (SD) layers to prevent overfitting and increase the model's depth. Our method outperforms the previous state-of-the-art method, achieving an average UAR and WAR improvement of 1.62% and 1.32%, respectively, across six benchmark SER datasets. Further experiments demonstrated that our method can fully extract spatiotemporal features in low-resource conditions.
- Björn W Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Commun. ACM, vol. 61, no. 5, pp. 90–99, 2018.
- “Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques,” Knowl. Based Syst., vol. 211, pp. 106547, 2021.
- Mustaqeem and Soonil Kwon, “Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network,” Int. J. Intell. Syst., vol. 36, no. 9, pp. 5116–5135, 2021.
- Ilyas Ozer, “Pseudo-colored rate map representation for speech emotion recognition,” Biomed. Signal Process. Control., vol. 66, pp. 102502, 2021.
- “The application of Capsule neural network based CNN for speech emotion recognition,” in in ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021. 2020, pp. 9356–9362, IEEE.
- “Speech emotion recognition with dual-sequence LSTM architecture,” in ICASSP 2020, Barcelona, Spain, May 4-8, 2020. 2020, pp. 6474–6478, IEEE.
- “Exploring spatio-temporal representations by integrating attention-based Bi-directional-LSTM-RNNs and FCNs for speech emotion recognition,” in Interspeech 2018, Hyderabad, India, 2-6 September 2018. 2018, pp. 272–276, ISCA.
- “A lightweight model based on separable convolution for speech emotion recognition,” in Interspeech 2020, Virtual Event, Shanghai, China, 25-29 October 2020. 2020, pp. 3331–3335, ISCA.
- “Speech Emotion Recognition Using Spectrogram and Phoneme Embedding,” in Interspeech 2018, Hyderabad, India, 2-6 September 2018. 2018, pp. 3688–3692, ISCA.
- “Efficient speech emotion recognition using multi-scale CNN and attention,” in ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 3020–3024, IEEE.
- “Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction,” Inf. Sci., vol. 509, pp. 150–163, 2020.
- “GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition,” Speech Commun., vol. 145, pp. 21–35, 2022.
- J Ancilin and A Milton, “Improved speech emotion recognition with Mel frequency magnitude coefficient,” Applied Acoustics, vol. 179, pp. 108046, 2021.
- “LIGHT-SERNET: A lightweight fully convolutional neural network for speech emotion recognition,” in ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 6912–6916, IEEE.
- “CTL-MTNet: A novel CapsNet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition,” in IJCAI 2022, Vienna, Austria, 23-29 July 2022, 2022, pp. 2305–2311.
- “Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition,” in ICASSP 2023, 2023, pp.
- “Dilated residual network with multi-head self-attention for speech emotion recognition,” in ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. 2019, pp. 6675–6679, IEEE.
- “3D CNN-based speech emotion recognition using K-means clustering and spectrograms,” Entropy, vol. 21, no. 5, pp. 479, 2019.
- “Design of speech corpus for Mandarin text to speech,” in The Blizzard Challenge 2008 workshop, 2008.
- “A database of German emotional speech,” in INTERSPEECH 2005, Lisbon, Portugal, September 4-8, 2005, 2005, vol. 5, pp. 1517–1520.
- “EMOVO corpus: an Italian emotional speech database,” in LREC 2014, 2014, pp. 3501–3504.
- “IEMOCAP: interactive emotional dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- “The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english,” PLOS ONE, vol. 13, no. 5, pp. e0196391, 2018.
- “Surrey audio-visual expressed emotion (SAVEE) database,” University of Surrey: Guildford, UK, 2014.
- “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, 2015, vol. 8, pp. 18–25.
- Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-SNE.,” J. Mach. Learn. Res., vol. 9, no. 11, 2008.
- Mengbo Li (1 paper)
- Yuanzhong Zheng (4 papers)
- Dichucheng Li (5 papers)
- Yulun Wu (22 papers)
- Yaoxuan Wang (5 papers)
- Haojun Fei (7 papers)