Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism (2403.04743v2)

Published 7 Mar 2024 in eess.AS

Abstract: Speech Emotion Recognition (SER) is crucial in human-machine interactions. Mainstream approaches utilize Convolutional Neural Networks or Recurrent Neural Networks to learn local energy feature representations of speech segments from speech information, but struggle with capturing global information such as the duration of energy in speech. Some use Transformers to capture global information, but there is room for improvement in terms of parameter count and performance. Furthermore, existing attention mechanisms focus on spatial or channel dimensions, hindering learning of important temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern recognition, vol. 44, no. 3, pp. 572–587, 2011.
  2. L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. Allen, “Detection of clinical depression in adolescents’ speech during family interactions,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 3, pp. 574–586, 2010.
  3. A. Mahdhaoui, M. Chetouani, and C. Zong, “Motherese detection based on segmental and supra-segmental features,” in 2008 19th International Conference on Pattern Recognition.   IEEE, 2008, pp. 1–4.
  4. S. E. Bou-Ghazale and J. H. Hansen, “A comparative study of traditional and newly proposed features for recognition of speech under stress,” IEEE Transactions on speech and audio processing, vol. 8, no. 4, pp. 429–442, 2000.
  5. C. Gobl and A. N. Chasaide, “The role of voice quality in communicating emotion, mood and attitude,” Speech communication, vol. 40, no. 1-2, pp. 189–212, 2003.
  6. J. Hernando and C. Nadeu, “Linear prediction of the one-sided autocorrelation sequence for noisy speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 1, pp. 80–84, 1997.
  7. S. S. Barpanda, B. Majhi, P. K. Sa, A. K. Sangaiah, and S. Bakshi, “Iris feature extraction through wavelet mel-frequency cepstrum coefficients,” Optics & Laser Technology, vol. 110, pp. 13–23, 2019.
  8. Y. Q. Qin and X. Y. Zhang, “Hmm-based speaker emotional recognition technology for speech signal,” in Advanced Materials Research, vol. 230.   Trans Tech Publ, 2011, pp. 261–265.
  9. J. Pribil, A. Pribilova, and J. Matousek, “Artefact determination by gmm-based continuous detection of emotional changes in synthetic speech,” in 2019 42nd International Conference on Telecommunications and Signal Processing (TSP).   IEEE, 2019, pp. 45–48.
  10. B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010.
  11. B. Gupta and S. Dhawan, “Deep learning research: Scientometric assessment of global publications output during 2004-17,” Emerging Science Journal, vol. 3, no. 1, pp. 23–32, 2019.
  12. S. Kumar, T. Roshni, and D. Himayoun, “A comparison of emotional neural network (enn) and artificial neural network (ann) approach for rainfall-runoff modelling,” Civil Engineering Journal, vol. 5, no. 10, pp. 2120–2130, 2019.
  13. Q. Cao, M. Hou, B. Chen, Z. Zhang, and G. Lu, “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6334–6338.
  14. J. Liu, Z. Liu, L. Wang, L. Guo, and J. Dang, “Speech emotion recognition with local-global aware deep representation learning,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7174–7178.
  15. R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar, and T. Alhussain, “Speech emotion recognition using deep learning techniques: A review,” IEEE Access, vol. 7, pp. 117 327–117 345, 2019.
  16. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  17. L. Tarantino, P. N. Garner, A. Lazaridis et al., “Self-attention for speech emotion recognition,” in Interspeech, 2019, pp. 2578–2582.
  18. X. Wang, M. Wang, W. Qi, W. Su, X. Wang, and H. Zhou, “A novel end-to-end speech emotion recognition network with stacked transformer layers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6289–6293.
  19. D. Hu, X. Hu, and X. Xu, “Multiple enhancements to lstm for learning emotion-salient features in speech emotion recognition,” Proc. Interspeech 2022, pp. 4720–4724, 2022.
  20. S. Kwon et al., “Att-net: Enhanced emotion recognition system using lightweight self-attention module,” Applied Soft Computing, vol. 102, p. 107101, 2021.
  21. L. Guo, L. Wang, C. Xu, J. Dang, E. S. Chng, and H. Li, “Representation learning with spectro-temporal-channel attention for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6304–6308.
  22. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  23. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  24. W. Zhu and X. Li, “Speech emotion recognition with global-aware fusion on multi-scale feature representation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6437–6441.
  25. E. Guizzo, T. Weyde, and J. B. Leveson, “Multi-time-scale convolution for emotion recognition from speech audio signals,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6489–6493.
  26. Y. Xu, H. Xu, and J. Zou, “Hgfm: A hierarchical grained and feature model for acoustic emotion recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6499–6503.
  27. C. Li, “Robotic emotion recognition using two-level features fusion in audio signals of speech,” IEEE Sensors Journal, 2021.
  28. J. Liu and H. Wang, “A speech emotion recognition framework for better discrimination of confusions,” in Interspeech, 2021, pp. 4483–4487.
  29. H. Zou, Y. Si, C. Chen, D. Rajan, and E. S. Chng, “Speech emotion recognition with co-attention based multi-level acoustic information,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7367–7371.
  30. S. Zhang, X. Zhao, and Q. Tian, “Spontaneous speech emotion recognition using multiscale deep convolutional lstm,” IEEE Transactions on Affective Computing, 2019.
  31. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 2018.
  32. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  33. P. Li, Y. Song, I. V. McLoughlin, W. Guo, and L.-R. Dai, “An attention pooling based representation learning method for speech emotion recognition,” 2018.
  34. Y.-X. Xi, Y. Song, L.-R. Dai, I. McLoughlin, and L. Liu, “Frontend attributes disentanglement for speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7712–7716.
  35. S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
  36. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” arXiv preprint arXiv:1910.05453, 2019.
  37. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12 449–12 460, 2020.
  38. L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
  39. X. Cai, J. Yuan, R. Zheng, L. Huang, and K. Church, “Speech emotion recognition with multi-task learning,” in Interspeech, vol. 2021, 2021, pp. 4508–4512.
  40. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  41. N. Ristea, R. Ionescu, and F. Khan, “Septr: Separable transformer for audio spectrogram processing. arxiv 2022,” arXiv preprint arXiv:2203.09581.
  42. H.-t. Xu, J. Zhang, and L.-r. Dai, “Differential time-frequency log-mel spectrogram features for vision transformer based infant cry recognition,” Proc. Interspeech 2022, pp. 1963–1967, 2022.
  43. Q.-L. Zhang and Y.-B. Yang, “Sa-net: Shuffle attention for deep convolutional neural networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 2235–2239.
  44. J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, “Bam: Bottleneck attention module,” arXiv preprint arXiv:1807.06514, 2018.
  45. Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
  46. Y. Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
  47. N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
  48. J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 175–12 185.
  49. Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13 713–13 722.
  50. A. Trockman and J. Z. Kolter, “Patches are all you need?” arXiv preprint arXiv:2201.09792, 2022.
  51. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
  52. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  53. F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss et al., “A database of german emotional speech.” in Interspeech, vol. 5, 2005, pp. 1517–1520.
  54. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  55. S. Suganya and E. Charles, “Speech emotion recognition using deep learning on audio recordings,” in 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer), vol. 250.   IEEE, 2019, pp. 1–6.
  56. S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Deep architecture enhancing robustness to noise, adversarial attacks, and cross-corpus setting for speech emotion recognition,” arXiv preprint arXiv:2005.08453, 2020.
  57. J. Wang, M. Xue, R. Culhane, E. Diao, J. Ding, and V. Tarokh, “Speech emotion recognition with dual-sequence lstm architecture,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6474–6478.
  58. S. Li, X. Xing, W. Fan, B. Cai, P. Fordson, and X. Xu, “Spatiotemporal and frequential cascaded attention networks for speech emotion recognition,” Neurocomputing, vol. 448, pp. 238–248, 2021.
  59. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. W. Schuller, “Multi-task semi-supervised adversarial autoencoding for speech emotion recognition,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 992–1004, 2022.
  60. Y. Gao, J. Liu, L. Wang, and J. Dang, “Metric learning based feature representation with gated fusion model for speech emotion recognition.” in Interspeech, 2021, pp. 4503–4507.
  61. D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng, “Learning discriminative features from spectrograms using center loss for speech emotion recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7405–7409.
  62. T. Tuncer, S. Dogan, and U. R. Acharya, “Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques,” Knowledge-Based Systems, vol. 211, p. 106547, 2021.
  63. S. Zhong, B. Yu, and H. Zhang, “Exploration of an independent training framework for speech emotion recognition,” IEEE Access, vol. 8, pp. 222 533–222 543, 2020.
  64. L. Kerkeni, Y. Serrestou, K. Raoof, M. Mbarki, M. A. Mahjoub, and C. Cleder, “Automatic speech emotion recognition using an optimal combination of features based on emd-tkeo,” Speech Communication, vol. 114, pp. 22–35, 2019.
  65. M. Hou, Z. Zhang, Q. Cao, D. Zhang, and G. Lu, “Multi-view speech emotion recognition via collective relation construction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 218–229, 2021.
  66. S. Mehta and M. Rastegari, “Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer,” arXiv preprint arXiv:2110.02178, 2021.
  67. Mehta, Sachin and Rastegari, Mohammad, “Separable self-attention for mobile vision transformers,” arXiv preprint arXiv:2206.02680, 2022.
  68. S. N. Wadekar and A. Chaurasia, “Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features,” arXiv preprint arXiv:2209.15159, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com