Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition (2310.03992v1)
Abstract: In this paper, we propose a new unsupervised domain adaptation (DA) method called layer-adapted implicit distribution alignment networks (LIDAN) to address the challenge of cross-corpus speech emotion recognition (SER). LIDAN extends our previous ICASSP work, deep implicit distribution alignment networks (DIDAN), whose key contribution lies in the introduction of a novel regularization term called implicit distribution alignment (IDA). This term allows DIDAN trained on source (training) speech samples to remain applicable to predicting emotion labels for target (testing) speech samples, regardless of corpus variance in cross-corpus SER. To further enhance this method, we extend IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated extention consists of three modified IDA terms that consider emotion labels at different levels of granularity. These terms are strategically arranged within different fully connected layers in LIDAN, aligning with the increasing emotion-discriminative abilities with respect to the layer depth. This arrangement enables LIDAN to more effectively learn emotion-discriminative and corpus-invariant features for SER across various corpora compared to DIDAN. It is also worthy to mention that unlike most existing methods that rely on estimating statistical moments to describe pre-assumed explicit distributions, both IDA and LIDA take a different approach. They utilize an idea of target sample reconstruction to directly bridge the feature distribution gap without making assumptions about their distribution type. As a result, DIDAN and LIDAN can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA corpora. The experimental results demonstrate that LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks.
- M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011.
- M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
- C. E. Williams and K. N. Stevens, “Emotions and speech: Some acoustical correlates,” The journal of the acoustical society of America, vol. 52, no. 4B, pp. 1238–1250, 1972.
- B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., vol. 2. Ieee, 2003, pp. II–1.
- S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion recognition using modulation spectral features,” Speech communication, vol. 53, no. 5, pp. 768–785, 2011.
- Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 801–804.
- S. Zhang, X. Zhao, and Q. Tian, “Spontaneous speech emotion recognition using multiscale deep convolutional lstm,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 680–688, 2019.
- D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomedical Signal Processing and Control, vol. 59, p. 101894, 2020.
- C. Lu, Y. Zong, W. Zheng, Y. Li, C. Tang, and B. W. Schuller, “Domain invariant feature learning for speaker-independent speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2217–2230, 2022.
- C. Lu, W. Zheng, H. Lian, Y. Zong, C. Tang, S. Li, and Y. Zhao, “Speech emotion recognition via an attentive time-frequency neural network,” IEEE Transactions on Computational Social Systems, 2022.
- B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010.
- M. Wang and W. Deng, “Deep visual domain adaptation: A survey,” Neurocomputing, vol. 312, pp. 135–153, 2018.
- F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
- A. Hassan, R. Damper, and M. Niranjan, “On acoustic emotion recognition: compensating for covariate shift,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1458–1468, 2013.
- A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf et al., “Covariate shift by kernel mean matching,” Dataset shift in machine learning, vol. 3, no. 4, p. 5, 2009.
- K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
- J. Zhang, L. Jiang, Y. Zong, W. Zheng, and L. Zhao, “Cross-corpus speech emotion recognition using joint distribution adaptive regression,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3790–3794.
- Y. Zhao, J. Wang, R. Ye, Y. Zong, W. Zheng, and L. Zhao, “Deep transductive transfer regression network for cross-corpus speech emotion recognition,” Proceedings of the INTERSPEECH, Incheon, Korea, pp. 18–22, 2022.
- M. Abdelwahab and C. Busso, “Domain adversarial for acoustic emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 12, pp. 2423–2435, 2018.
- J. Gideon, M. G. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, vol. 12, no. 4, pp. 1055–1068, 2021.
- S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Self supervised adversarial domain adaptation for cross-corpus and cross-language speech emotion recognition,” IEEE Transactions on Affective Computing, 2022.
- Y. Gao, L. Wang, J. Liu, J. Dang, and S. Okada, “Adversarial domain generalized transformer for cross-corpus speech emotion recognition,” IEEE Transactions on Affective Computing, 2023.
- Y. Zhao, J. Wang, Y. Zong, W. Zheng, H. Lian, and L. Zhao, “Deep implicit distribution alignment networks for cross-corpus speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer feature learning with joint distribution adaptation,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2200–2207.
- F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss et al., “A database of german emotional speech.” in Interspeech, vol. 5, 2005, pp. 1517–1520.
- O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, 2006, pp. 8–8.
- J. Zhang and H. Jia, “Design of speech corpus for mandarin text to speech,” in The blizzard challenge 2008 workshop, 2008.
- S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
- B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 2066–2073.
- B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2960–2967.
- N. Liu, Y. Zong, B. Zhang, L. Liu, J. Chen, G. Zhao, and J. Zhu, “Unsupervised cross-corpus speech emotion recognition using domain-adaptive subspace learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5144–5148.
- M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning. PMLR, 2015, pp. 97–105.
- M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International conference on machine learning. PMLR, 2017, pp. 2208–2217.
- Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and Q. He, “Deep subdomain adaptation network for image classification,” IEEE transactions on neural networks and learning systems, vol. 32, no. 4, pp. 1713–1722, 2020.
- H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain-adversarial neural networks,” arXiv preprint arXiv:1412.4446, 2014.
- M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” Advances in neural information processing systems, vol. 31, 2018.
- B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in INTERSPEECH, 2009.
- F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015.
- F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
- S. Jing, X. Mao, and L. Chen, “Automatic speech discrete labels to dimensional emotional values conversion method,” IET Biometrics, vol. 8, no. 2, pp. 168–176, 2019.
- A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic, “Estimation of continuous valence and arousal levels from faces in naturalistic conditions,” Nature Machine Intelligence, vol. 3, no. 1, pp. 42–50, 2021.
- L. Yang, Y. Shen, Y. Mao, and L. Cai, “Hybrid curriculum learning for emotion recognition in conversation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 595–11 603.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.