Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer
Abstract: Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.
- “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- “Deep learning techniques for speech emotion recognition, from databases to models,” Sensors, vol. 21, no. 4, pp. 1249–1276, 2021.
- “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. INTERSPEECH, 2019, p. 1613–1617.
- “Vocal tract length perturbation (VTLP) improves speech recognition,” in Proc. ICML Workshop on Deep Learning for Audio, Speech and Language, 2013, vol. 117, pp. 21–25.
- “Audio augmentation for speech recognition,” in Proc. INTERSPEECH, 2015, pp. 3586–3589.
- “StarGAN for emotional speech conversion: Validated by data augmentation of end-to-end emotion recognition,” in International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 3502–3506.
- “CycleGAN-Based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition,” in Proc. INTERSPEECH, 2019, pp. 2828–2832.
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in ICML. PMLR, 2018, pp. 5180–5189.
- Computational paralinguistics: emotion, affect and personality in speech and language processing, John Wiley & Sons, 2013.
- “On the robustness of speech emotion recognition for human-robot interaction with deep neural networks,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 854–860.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH. 2020, pp. 3830–3834, ISCA.
- “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
- “Attention-based models for speech recognition,” NeurIPS, pp. 577–585, 2015.
- “Hifi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, vol. 33, pp. 17022–17033, 2020.
- “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
- “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- “Representation learning with spectro-temporal-channel attention for speech emotion recognition,” in Proc. ICASSP, 2021, pp. 6304–6308.
- “Self-attention for speech emotion recognition,” in Proc. INTERSPEECH. 2019, pp. 2578–2582, ISCA.
- “Speech emotion recognition with co-attention based multi-level acoustic information,” in Proc. ICASSP, 2022, pp. 7367–7371.
- “Transforming spectrum and prosody for emotional voice conversion with non-parallel training data,” in Proc. of The Speaker and Language Recognition Workshop, 2020, p. 230–237.
- “CopyPaste: An augmentation method for speech emotion recognition,” in Proc. ICASSP, 2021, pp. 6324–6328.
- “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning. 2022, vol. 162, pp. 1298–1312, PMLR.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, pp. 1505–1518, 2022.
- “CNN+LSTM architecture for speech emotion recognition with data augmentation,” in Workshop on Speech, Music and Mind (SMM), 2018, pp. 21–25.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.