Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-end transfer learning for speaker-independent cross-language and cross-corpus speech emotion recognition (2311.13678v2)

Published 22 Nov 2023 in eess.AS

Abstract: Data-driven models achieve successful results in Speech Emotion Recognition (SER). However, these models, which are often based on general acoustic features or end-to-end approaches, show poor performance when the testing set has a different language than the training set or when these sets are taken from different datasets. To alleviate these problems, this paper presents an end-to-end Deep Neural Network (DNN) model based on transfer learning for cross-language and cross-corpus SER. We use the wav2vec 2.0 pre-trained model to transform audio time-domain waveforms from different languages, different speakers and different recording conditions into a feature space shared by multiple languages, thereby reducing the language variabilities in the speech embeddings. Next, we propose a new Deep-Within-Class Covariance Normalisation (Deep-WCCN) layer that can be inserted into the DNN model and aims to reduce other variabilities including speaker variability, channel variability and so on. The entire model is fine-tuned in an end-to-end manner on a combined loss and is validated on datasets from three languages (i.e. English, German, Chinese). Experimental results show that our proposed method outperforms the baseline model that is based on common acoustic feature sets for SER in the within-language setting and the cross-language setting. In addition, we also experimentally validate the effectiveness of Deep-WCCN, which can further improve the model performance. Next, we show that the proposed transfer learning method has good data efficiency when merging target language data into the fine-tuning process. The model speaker-independent SER performance increases with up to 15.6% when only 160s of target language data is used. Finally, our proposed model shows significantly better performance than other state-of-the-art models in cross-language SER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. “Progress in measuring subjective well-being,” Science (80-. )., vol. 346, no. 6205, pp. 42–43, 2014.
  2. “Emotion Recognition and Detection Methods: A Comprehensive Survey,” J. Artif. Intell. Syst., vol. 2, no. 1, pp. 53–79, 2020.
  3. “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine - Belief network architecture,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 2004, vol. 1.
  4. “Acoustical properties of speech as indicators of depression and suicidal risk,” IEEE Trans. Biomed. Eng., vol. 47, no. 7, pp. 829–837, 2000.
  5. “Emotion detection and regulation from personal assistant robot in smart environment,” in Intell. Syst. Ref. Libr., vol. 132, pp. 179–195. Springer, Cham, 2018.
  6. “Acoustic emotion recognition: A benchmark comparison of performances,” in Proc. 2009 IEEE Work. Autom. Speech Recognit. Understanding, ASRU 2009, 2009, pp. 552–557.
  7. “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognit., vol. 44, no. 3, pp. 572–587, 2011.
  8. “Speech emotion recognition cross language families: Mandarin vs. Western Languages,” in PIC 2016 - Proc. 2016 IEEE Int. Conf. Prog. Informatics Comput. jun 2017, pp. 253–257, Institute of Electrical and Electronics Engineers Inc.
  9. “Attention-enhanced connectionist temporal classification for discrete speech emotion recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2019, vol. 2019-Septe, pp. 206–210.
  10. “Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech,” in ICASSP 2019 - 2019 IEEE Int. Conf. Acoust. Speech Signal Process. may 2019, pp. 6705–6709, IEEE.
  11. “End-to-End Speech Emotion Recognition Using Deep Neural Networks,” in 2018 IEEE Int. Conf. Acoust. Speech Signal Process. apr 2018, pp. 5089–5093, IEEE.
  12. “End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network,” Eurasip J. Audio, Speech, Music Process., vol. 2021, no. 1, pp. 1–16, dec 2021.
  13. “Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM,” in 2013 Natl. Conf. Commun. NCC 2013. 2013, Institute of Electrical and Electronics Engineers Inc.
  14. “Context-independent multilingual emotion recognition from speech signals,” Int. J. Speech Technol., vol. 6, no. 3, pp. 311–320, 2003.
  15. “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011.
  16. Michael Neumann and N. Goc Thang Vu, “CRoss-lingual and Multilingual Speech Emotion Recognition on English and French,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. sep 2018, vol. 2018-April, pp. 5769–5773, Institute of Electrical and Electronics Engineers Inc.
  17. “Transfer learning for improving speech emotion classification accuracy,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2018, vol. 2018-Septe, pp. 257–261.
  18. “EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition,” 2021.
  19. Peng Song, “Transfer linear subspace learning for cross-corpus speech emotion recognition,” IEEE Trans. Affect. Comput., vol. 10, no. 2, pp. 265–275, apr 2019.
  20. “Transfer sparse discriminant subspace learning for cross-corpus speech emotion recognition,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 28, pp. 307–318, 2020.
  21. “Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition,” in 2019 8th Int. Conf. Affect. Comput. Intell. Interact. ACII 2019. sep 2019, Institute of Electrical and Electronics Engineers Inc.
  22. “Cross-culture Multimodal Emotion Recognition with Adversarial Learning,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. may 2019, vol. 2019-May, pp. 4000–4004, Institute of Electrical and Electronics Engineers Inc.
  23. “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Trans. Affect. Comput., mar 2019.
  24. “Wasserstein GAN,” 2017.
  25. “Unsupervised Cross-lingual Representation Learning for Speech Recognition,” 2020.
  26. “Deep Within-Class Covariance Analysis for Robust Audio Representation Learning,” 2017.
  27. “Speech information retrieval: A review,” 2012.
  28. “Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
  29. “Cross-lingual speech emotion recognition through factor analysis,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2018, vol. 2018-Septe, pp. 3648–3652.
  30. “Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. mar 2016, vol. 2016-May, pp. 5200–5204, IEEE.
  31. “Emotion identification from raw speech signals using DNNs,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2018, vol. 2018-Septe, pp. 3097–3101.
  32. “Representation Learning with Contrastive Predictive Coding,” 2018.
  33. “WAV2vec: Unsupervised pre-training for speech recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2019, vol. 2019-Septe, pp. 3465–3469.
  34. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst., 2020, vol. 2020-Decem.
  35. “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” 2021, pp. 3400–3404.
  36. “Generalized linear kernels for one-versus-all classification: Application to speaker recognition,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 2006, vol. 5.
  37. “Contrastive Unsupervised Learning for Speech Emotion Recognition,” pp. 6329–6333, may 2021.
  38. “A database of German emotional speech,” in 9th Eur. Conf. Speech Commun. Technol., 2005, pp. 1517–1520.
  39. “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english,” PLoS One, vol. 13, no. 5, pp. e0196391, may 2018.
  40. “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 2021, vol. 2021-June, pp. 920–924, Institute of Electrical and Electronics Engineers Inc.
  41. “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing,” IEEE Trans. Affect. Comput., vol. 7, no. 2, pp. 190–202, 2016.
  42. “OpenEAR - Introducing the Munich open-source emotion and affect recognition toolkit,” Proc. - 2009 3rd Int. Conf. Affect. Comput. Intell. Interact. Work. ACII 2009, 2009.
  43. “Adaptive subgradient methods for online learning and stochastic optimization,” in COLT 2010 - 23rd Conf. Learn. Theory, 2010, vol. 12, pp. 257–269.
  44. “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” in 30th Int. Conf. Mach. Learn. ICML 2013, 2013, vol. 28, pp. 115–123.
  45. “Cross-Corpus Speech Emotion Recognition Based on Hybrid Neural Networks,” in Chinese Control Conf. CCC. jul 2020, vol. 2020-July, pp. 7464–7468, IEEE Computer Society.
  46. “Analysis of deep learning architectures for cross-corpus speech emotion recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2019, vol. 2019-Septe, pp. 1656–1660.
  47. “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” 2021.
  48. “Distilling the Knowledge in a Neural Network,” 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Duowei Tang (1 paper)
  2. Peter Kuppens (1 paper)
  3. Luc Geurts (1 paper)
  4. Toon van Waterschoot (37 papers)

Summary

We haven't generated a summary for this paper yet.