Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition (2307.02820v1)

Published 6 Jul 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Deep-net: A lightweight cnn-based speech emotion recognition system using deep frequency features. Sensors 20, 5212.
  2. Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering (iJOE) 16, 15.
  3. Voice recognition based on adaptive mfcc and deep learning, in: 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), IEEE. pp. 1542–1546.
  4. Fundamentals of speaker recognition. Springer Science & Business Media.
  5. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems 184, 104886.
  6. Emotion recognition in real-world support call center data for latvian language, in: ACM IUI Workshops 2022, Springer, Helsinki, Finland.
  7. A database of german emotional speech., in: Interspeech, pp. 1517–1520.
  8. Speech emotion recognition with multi-task learning., in: Interspeech, pp. 4508–4512.
  9. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 377–390.
  10. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Information Sciences 509, 150–163.
  11. An introduction to support vector machines. Cambridge University Press, Cambridge, UK.
  12. Very deep convolutional neural networks for raw waveforms, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 421–425.
  13. Feature extraction methods lpc, plp and mfcc in speech recognition. International Journal for Advance Research in Engineering and Technology 1, 1–4.
  14. Is combining classifiers with stacking better than selecting the best one? Machine Learning 54, 255–273.
  15. Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors 20, 6008.
  16. Speech emotion recognition using deep neural network and extreme learning machine, in: Interspeech 2014.
  17. Audio-visual feature selection and reduction for emotion classification, in: Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma, Australia.
  18. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
  19. Long short-term memory. Neural Computation 9, 1735–1780.
  20. Speech acoustic modeling from raw multichannel waveforms, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 4624–4628.
  21. Speech emotion recognition. International Journal of Soft Computing and Engineering (IJSCE) 2, 235–238.
  22. Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control 59, 101894.
  23. Learning a better representation of speech soundwaves using restricted boltzmann machines, in: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 5884–5887.
  24. Feature extraction algorithms to improve the speech emotion recognition rate. International Journal of Speech Technology 23, 45–55.
  25. Emotion classification from speech signal based on empirical mode decomposition and non-linear features. Complex & Intelligent Systems 7, 1919–1934.
  26. A cnn-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20, 183.
  27. Mlt-dnet: Speech emotion recognition using 1d dilated cnn based on multi-learning trick approach. Expert Systems with Applications 167, 114177.
  28. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13, e0196391.
  29. Domain invariant feature learning for speaker-independent speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2217–2230.
  30. Real time multiple face recognition: A deep learning approach, in: Proceedings of the 2018 International Conference on Digital Medicine and Image Processing, pp. 70–76.
  31. K-nearest neighbor classification. Data Mining in Agriculture 34, 83–106.
  32. Joint deep cross-domain transfer learning for emotion recognition. arXiv:arXiv:2003.11136.
  33. A novel feature selection method for speech emotion recognition. Applied Acoustics 146, 320--326.
  34. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint arXiv:1304.1018 .
  35. Toronto emotional speech set (TESS). URL: https://doi.org/10.5683/SP2/E8H2MF, doi:10.5683/SP2/E8H2MF.
  36. Self-paced ensemble learning for speech and audio classification. arXiv:arXiv:2103.11988.
  37. Data mining with decision trees: Theory and applications. World Scientific, New Jersey, U.S.A.
  38. Clustering-based speech emotion recognition by incorporating learned features and deep bilstm. IEEE Access 8, 79861--79875.
  39. Human speech emotion recognition. International Journal of Engineering & Technology 8, 311--323.
  40. A lightweight 2d cnn based approach for speaker-independent emotion recognition from speech with new indian emotional speech corpora. Multimedia Tools and Applications .
  41. A scale for the measurement of the psychological magnitude pitch. The journal of the acoustical society of america 8, 185--190.
  42. Speech emotion recognition using deep learning on audio recordings. 2019 19th International Conference on Advances in ICT for Emerging Regions (ICTer) .
  43. Speech emotion recognition based on dnn-decision tree svm model. Speech Communication 115, 29--37.
  44. Speaker recognition using mfcc and improved weighted vector quantization algorithm. International Journal of Engineering and Technology 5, 1685--1692.
  45. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network, in: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5200--5204.
  46. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems 211, 106547.
  47. Acoustic modeling with deep neural networks using raw time signal for lvcsr, in: Fifteenth annual conference of the international speech communication association.
  48. Speech emotion recognition-a deep learning approach. doi:10.1109/i-smac52330.2021.9640995.
  49. Naive bayes. Encyclopedia of Machine Learning 15, 713--714.
  50. Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex & Intelligent Systems 7, 1845--1854.
  51. Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Communication 127, 73--81.
  52. Speech emotion recognition using deep 1d & 2d cnn lstm networks. Biomedical signal processing and control 47, 312--323.
  53. A novel feature selection method for speech emotion recognition. Applied Acoustics 146, 320–326.
Citations (2)

Summary

We haven't generated a summary for this paper yet.