Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition (2401.11017v1)

Published 19 Jan 2024 in eess.AS, cs.LG, and cs.SD

Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “The ambiguous world of emotion representation,” ArXiv, vol. abs/1909.00360, 2019.
  2. “X-vectors meet emotions: A study on dependencies between emotion and speaker recognition,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173, 2020.
  3. “Improved speech emotion recognition using transfer learning and spectrogram augmentation,” Proceedings of the 2021 International Conference on Multimodal Interaction, 2021.
  4. Sitong Zhou and Homayoon S. M. Beigi, “A transfer learning method for speech emotion recognition from automatic speech recognition,” ArXiv, vol. abs/2008.02863, 2020.
  5. “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech 2021, 2021, pp. 3400–3404.
  6. “Voxceleb: A large-scale speaker identification dataset,” in Interspeech, 2017.
  7. “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018.
  8. “You’re not you when you’re angry: Robust emotion features emerge by recognizing speakers,” IEEE Transactions on Affective Computing, vol. 14, pp. 1351–1362, 2023.
  9. “A study of speaker verification performance with expressive speech,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5540–5544, 2017.
  10. “Predicting speaker recognition reliability by considering emotional content,” 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 434–439, 2017.
  11. “Exploring the intersection between speaker verification and emotion recognition,” 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 337–342, 2019.
  12. “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 788–798, 2011.
  13. “Disentangling style factors from speaker representations,” in Interspeech, 2019.
  14. “Generalized end-to-end loss for speaker verification,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, 2017.
  15. “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020, pp. 3830–3834.
  16. “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
  17. “Speech emotion recognition using self-supervised features,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6922–6926, 2022.
  18. “Contrastive unsupervised learning for speech emotion recognition,” ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6329–6333, 2021.
  19. “Improving Speech Emotion Recognition Using Self-Supervised Learning with Domain-Specific Audiovisual Tasks,” in Proc. Interspeech 2022, 2022, pp. 1168–1172.
  20. “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
  21. “Emotional voice conversion: Theory, databases and esd,” Speech Communication, vol. 137, pp. 1–18, 2022.
  22. “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014.
  23. “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, pp. e0196391, 2018.
  24. “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” J. Mach. Learn. Res., vol. 11, pp. 2837–2854, 2010.
  25. “A comparison of internal and external cluster validation indexes,” in Proceedings of the 2011 American Conference on Applied Mathematics and the 5th WSEAS International Conference on Computer Engineering and Applications, Stevens Point, Wisconsin, USA, 2011, AMERICAN-MATH’11/CEA’11, p. 158–163, World Scientific and Engineering Academy and Society (WSEAS).
  26. Peter J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.
  27. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  28. “Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification,” ArXiv, vol. abs/2304.05754, 2023.
  29. “Self-supervised speaker recognition with loss-gated learning,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6142–6146, 2021.
  30. “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning. 2020, ICML’20, JMLR.org.
  31. “Momentum contrast for unsupervised visual representation learning,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ismail Rasim Ulgen (6 papers)
  2. Zongyang Du (7 papers)
  3. Carlos Busso (25 papers)
  4. Berrak Sisman (49 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets