Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer (2403.17327v1)

Published 26 Mar 2024 in cs.SD, cs.CV, and eess.AS

Abstract: In this paper, we propose a method to improve the accuracy of speech emotion recognition (SER) by using vision transformer (ViT) to attend to the correlation of frequency (y-axis) with time (x-axis) in spectrogram and transferring positional information between ViT through knowledge transfer. The proposed method has the following originality i) We use vertically segmented patches of log-Mel spectrogram to analyze the correlation of frequencies over time. This type of patch allows us to correlate the most relevant frequencies for a particular emotion with the time they were uttered. ii) We propose the use of image coordinate encoding, an absolute positional encoding suitable for ViT. By normalizing the x, y coordinates of the image to -1 to 1 and concatenating them to the image, we can effectively provide valid absolute positional information for ViT. iii) Through feature map matching, the locality and location information of the teacher network is effectively transmitted to the student network. Teacher network is a ViT that contains locality of convolutional stem and absolute position information through image coordinate encoding, and student network is a structure that lacks positional encoding in the basic ViT structure. In feature map matching stage, we train through the mean absolute error (L1 loss) to minimize the difference between the feature maps of the two networks. To validate the proposed method, three emotion datasets (SAVEE, EmoDB, and CREMA-D) consisting of speech were converted into log-Mel spectrograms for comparison experiments. The experimental results show that the proposed method significantly outperforms the state-of-the-art methods in terms of weighted accuracy while requiring significantly fewer floating point operations (FLOPs). Overall, the proposed method offers an promising solution for SER by providing improved efficiency and performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Mehmet Bilal Er. A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access, 8:221640–221653, 2020.
  2. Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and gaussian elliptical basis function network classifier. Applied Acoustics, 166:107360, 2020.
  3. Speech based emotion recognition based on hierarchical decision tree with svm, blg and svr classifiers. In 2013 national conference on communications (NCC), pages 1–5. IEEE, 2013.
  4. A supervised non-negative matrix factorization model for speech emotion recognition. Speech Communication, 124:13–20, 2020.
  5. Self-attention for speech emotion recognition. In Interspeech, pages 2578–2582, 2019.
  6. Advanced fusion-based speech emotion recognition system using a dual-attention mechanism with conv-caps and bi-gru features. Electronics, 11(9):1328, 2022.
  7. Hybrid lstm-transformer model for emotion recognition from speech audio files. IEEE Access, 10:36018–36027, 2022.
  8. Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Processing, 104:102763, 2020.
  9. The impact of attention mechanisms on speech emotion recognition. Sensors, 21(22):7530, 2021.
  10. Head fusion: Improving the accuracy and robustness of speech emotion recognition on the iemocap and ravdess dataset. IEEE Access, 9:74539–74549, 2021.
  11. Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE access, 7:125868–125881, 2019.
  12. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, 7:90368–90377, 2019.
  13. A first look into a convolutional neural network for speech emotion detection. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5115–5119. IEEE, 2017.
  14. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, pages 571–575, 2021.
  15. Septr: Separable transformer for audio spectrogram processing. CoRR, abs/2203.09581, 2022.
  16. Deep learning. nature, 521(7553):436–444, 2015.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  18. Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021.
  19. End-to-end speaker-attributed ASR with transformer. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, editors, Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 4413–4417. ISCA, 2021.
  20. Multi-encoder learning and stream fusion for transformer-based end-to-end automatic speech recognition. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, editors, Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 2846–2850. ISCA, 2021.
  21. Online compressive transformer for end-to-end speech recognition. In 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, pages 1500–1504. International Speech Communication Association, 2021.
  22. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  23. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  24. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  25. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  26. An empirical study of training self-supervised vision transformers. in 2021 ieee. In CVF International Conference on Computer Vision (ICCV), pages 9620–9629.
  27. How much position information do convolutional neural networks encode? In International Conference on Learning Representations, 2020.
  28. S. Haq and P.J.B. Jackson. Machine Audition: Principles, Algorithms and Systems, chapter Multimodal Emotion Recognition, pages 398–423. IGI Global, Hershey PA, Aug. 2010.
  29. A database of german emotional speech. In INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005, pages 1517–1520. ISCA, 2005.
  30. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com