Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations (2309.04849v2)
Abstract: We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
- “Emotion recognition from speech using wav2vec 2.0 embeddings,” Interspeech, pp. 3400–3404, 2021.
- “Fusing asr outputs in joint training for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7362–7366.
- “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
- “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
- “Investigation on joint representation learning for robust feature extraction in speech emotion recognition.,” in Interspeech, 2018, pp. 152–156.
- “A fine-tuned Wav2Vec 2.0/HuBERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
- “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023.
- “Multimodal cross-and self-attention network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 4275–4279.
- “Bimodal speech emotion recognition using pre-trained language models,” arXiv preprint arXiv:1912.02610, 2019.
- “Multistage linguistic conditioning of convolutional layers for speech emotion recognition,” Frontiers in Computer Science, vol. 5, pp. 1072479, 2023.
- “Exploring attention mechanisms for multimodal emotion recognition in an emergency call center corpus,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- “Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network,” IEEE Access, vol. 8, pp. 61672–61686, 2020.
- “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- A. Hajavi and A. Etemad, “Audio representation learning by distilling video as privileged information,” IEEE Trans. on Artificial Intelligence, 2023.
- “Fast yet effective speech emotion recognition with self-distillation,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Trans. on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
- “Deep residual learning for image recognition,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM Int. Conf. on Multimedia, 2010, pp. 1459–1462.
- “Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
- “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 6912–6916.
- “Speech emotion recognition with local-global aware deep representation learning,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7174–7178.
- “Hierarchical network based on the fusion of static and dynamic features for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021, pp. 6334–6338.
- “Speech sentiment analysis via pre-trained features from end-to-end asr models,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020, pp. 7149–7153.
- “Speech emotion recognition using sequential capsule networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 3280–3291, 2021.
- “Speech emotion recognition with co-attention based multi-level acoustic information,” in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2022, pp. 7367–7371.