GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition (2306.07848v10)
Abstract: Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods.
- A. Ando, R. Masumura, H. Kamiyama, et al., “Customer satisfaction estimation in contact center calls based on a hierarchical multi-task model,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 715–728, 2020.
- A. Aftab, A. Morsali, S. Ghaemmaghami, et al., “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6912–6916.
- J. Ye, X. cheng Wen, Y. Wei, et al., “Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition,” 2023.
- L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
- L.-W. Chen and A. Rudnicky, “Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- E. Morais, R. Hoory, W. Zhu, et al., “Speech emotion recognition using self-supervised features,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6922–6926.
- I. Gat, H. Aronowitz, W. Zhu, et al., “Speaker normalization for self-supervised speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7342–7346.
- A. Baevski, Y. Zhou, A. Mohamed, et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- C. Busso, M. Bulut, C.-C. Lee, et al., “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
- Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- S. Chen, C. Wang, Z. Chen, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- A. Baevski, W.-N. Hsu, Q. Xu, et al., “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning. PMLR, 2022, pp. 1298–1312.
- Y. Pan, Y. Yang, Y. Huang, et al., “Msac: Multiple speech attribute control method for reliable speech emotion recognition,” arXiv preprint arXiv:2308.04025, 2023.
- A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- B. Elizalde, S. Deshmukh, M. Al Ismail, et al., “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Y. Wu, K. Chen, T. Zhang, et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Y. Meng, X. Li, Z. Wu, et al., “Calm: Constrastive cross-modal speaking style modeling for expressive text-to-speech synthesis,” Proc. Interspeech 2022, pp. 5533–5537, 2022.
- W. Chen, X. Xing, X. Xu, et al., “Key-sparse transformer for multimodal speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6897–6901.
- D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion of wav2vec 2.0 and bert for multimodal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Z. Zhao, Y. Wang, and Y. Wang, “Knowledge-aware bayesian co-attention for multimodal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- S. Ghosh, U. Tyagi, S. Ramaneswaran, et al., “Mmer: Multimodal multi-task learning for speech emotion recognition,” arXiv preprint arXiv:2203.16794, 2022.
- Y. Li, T. Zhao, T. Kawahara, et al., “Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning.” in Interspeech, 2019, pp. 2803–2807.
- Y. Liu, H. Sun, W. Guan, et al., “A discriminative feature representation method based on cascaded attention network with adversarial strategy for speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1063–1074, 2023.
- Yu Pan (154 papers)
- Yanni Hu (8 papers)
- Yuguang Yang (37 papers)
- Wen Fei (4 papers)
- Jixun Yao (35 papers)
- Heng Lu (41 papers)
- Lei Ma (195 papers)
- Jianjun Zhao (63 papers)