Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations (2309.04849v2)

Published 9 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  3. “Emotion recognition from speech using wav2vec 2.0 embeddings,” Interspeech, pp. 3400–3404, 2021.
  4. “Fusing asr outputs in joint training for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7362–7366.
  5. “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
  6. “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  7. “Investigation on joint representation learning for robust feature extraction in speech emotion recognition.,” in Interspeech, 2018, pp. 152–156.
  8. “A fine-tuned Wav2Vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  9. “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023.
  10. “Multimodal cross-and self-attention network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 4275–4279.
  11. “Bimodal speech emotion recognition using pre-trained language models,” arXiv preprint arXiv:1912.02610, 2019.
  12. “Multistage linguistic conditioning of convolutional layers for speech emotion recognition,” Frontiers in Computer Science, vol. 5, pp. 1072479, 2023.
  13. “Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  14. “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672–61686, 2020.
  15. “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  16. A. Hajavi and A. Etemad, “Audio representation learning by distilling video as privileged information,” IEEE Trans. on Artificial Intelligence, 2023.
  17. “Fast yet effective speech emotion recognition with self-distillation,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  18. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  19. “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Trans. on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
  20. “Deep residual learning for image recognition,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  21. “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM Int. Conf. on Multimedia, 2010, pp. 1459–1462.
  22. “Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  23. “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 6912–6916.
  24. “Speech emotion recognition with local-global aware deep representation learning,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7174–7178.
  25. “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 6334–6338.
  26. “Speech sentiment analysis via pre-trained features from end-to-end asr models,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7149–7153.
  27. “Speech emotion recognition using sequential capsule networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3280–3291, 2021.
  28. “Speech emotion recognition with co-attention based multi-level acoustic information,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7367–7371.
Citations (5)

Summary

We haven't generated a summary for this paper yet.