Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
103 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
50 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation (2401.09752v1)

Published 18 Jan 2024 in cs.SD, cs.LG, and eess.AS

Abstract: In speaker-independent speech emotion recognition, the training and testing samples are collected from diverse speakers, leading to a multi-domain shift challenge across the feature distributions of data from different speakers. Consequently, when the trained model is confronted with data from new speakers, its performance tends to degrade. To address the issue, we propose a Dynamic Joint Distribution Adaptation (DJDA) method under the framework of multi-source domain adaptation. DJDA firstly utilizes joint distribution adaptation (JDA), involving marginal distribution adaptation (MDA) and conditional distribution adaptation (CDA), to more precisely measure the multi-domain distribution shifts caused by different speakers. This helps eliminate speaker bias in emotion features, allowing for learning discriminative and speaker-invariant speech emotion features from coarse-level to fine-level. Furthermore, we quantify the adaptation contributions of MDA and CDA within JDA by using a dynamic balance factor based on $\mathcal{A}$-Distance, promoting to effectively handle the unknown distributions encountered in data from new speakers. Experimental results demonstrate the superior performance of our DJDA as compared to other state-of-the-art (SOTA) methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Computational paralinguistics: emotion, affect and personality in speech and language processing, John Wiley & Sons, 2013.
  2. Björn W Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, 2018.
  3. “Speech emotion recognition via an attentive time–frequency neural network,” IEEE Transactions on Computational Social Systems, 2022.
  4. “Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensemble,” in Interspeech, 2005.
  5. “Speaker independent speech emotion recognition by ensemble classification,” in IEEE international conference on multimedia and expo. IEEE, 2005, pp. 864–867.
  6. “Domain invariant feature learning for speaker-independent speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2217–2230, 2022.
  7. “Adaptive domain-aware representation learning for speech emotion recognition.,” in INTERSPEECH, 2020, pp. 4089–4093.
  8. “Deep Transductive Transfer Regression Network for Cross-Corpus Speech Emotion Recognition,” in Interspeech, 2022, pp. 371–375.
  9. “Deep implicit distribution alignment networks for cross-corpus speech emotion recognition,” arXiv preprint arXiv:2302.08921, 2023.
  10. “Transfer learning with dynamic adversarial adaptation network,” in IEEE International Conference on Data Mining. IEEE, 2019, pp. 778–786.
  11. “Transfer learning with dynamic distribution adaptation,” ACM Transactions on Intelligent Systems and Technology, vol. 11, no. 1, pp. 1–25, 2020.
  12. “Analysis of representations for domain adaptation,” Advances in neural information processing systems, vol. 19, 2006.
  13. “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  14. “A database of german emotional speech.,” in Interspeech, 2005, vol. 5, pp. 1517–1520.
  15. “Acoustic emotion recognition: A benchmark comparison of performances,” in IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 2009, pp. 552–557.
  16. “Deep neural networks for acoustic emotion recognition: Raising the benchmarks,” in IEEE international conference on acoustics, speech and signal processing. IEEE, 2011, pp. 5688–5691.
  17. “Revisiting hidden markov models for speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019, pp. 6715–6719.
  18. “Speech emotion classification using attention-based lstm,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1675–1685, 2019.
  19. “Speech emotion recognition using capsule networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019, pp. 6695–6699.
  20. “Attention based fully convolutional network for speech emotion recognition,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2018, pp. 1771–1775.
  21. “Deep encoded linguistic and acoustic cues for attention based end to end speech emotion recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, pp. 7189–7193.
  22. “Speech emotion recognition using deep neural network and extreme learning machine,” in Interspeech, 2014.
  23. “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
  24. “Speech emotion recognition using deep 1d & 2d cnn lstm networks,” Biomedical signal processing and control, vol. 47, pp. 312–323, 2019.
  25. “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2017.
  26. “Towards adversarial learning of speaker-invariant representation for speech emotion recognition,” arXiv preprint arXiv:1903.09606, 2019.
  27. “Learning transferable features with deep adaptation networks,” in International conference on machine learning. PMLR, 2015, pp. 97–105.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets