Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Emotion Diarization: Which Emotion Appears When? (2306.12991v2)

Published 22 Jun 2023 in cs.CL

Abstract: Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. “Emotion representation, analysis and synthesis in continuous space: A survey,” in 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG). IEEE, 2011, pp. 827–834.
  2. “Speech emotion recognition using deep learning techniques: A review,” IEEE Access, vol. 7, pp. 117327–117345, 2019.
  3. “A comprehensive review of speech emotion recognition systems,” IEEE Access, vol. 9, pp. 47795–47814, 2021.
  4. “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
  5. “Speech emotion recognition using capsule networks,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6695–6699.
  6. “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  7. “Speaker normalization for self-supervised speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7342–7346.
  8. “Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization,” Consciousness and cognition, vol. 17, no. 2, pp. 484–495, 2008.
  9. James A Russell, “A circumplex model of affect.,” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980.
  10. Albert Mehrabian, “Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament,” Current Psychology, vol. 14, pp. 261–292, 1996.
  11. “Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening,” IEEE Journal of selected topics in signal processing, vol. 4, no. 5, pp. 867–881, 2010.
  12. “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies,” 2008.
  13. “Sewa db: A rich database for audio-visual emotion and sentiment research in the wild,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 1022–1040, 2019.
  14. “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1–8.
  15. “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 5–17, 2011.
  16. “Unmasking the face: A guide to recognizing emotions from facial clues.,” 1975.
  17. “Emotion recognition from speech with recurrent neural networks,” arXiv preprint arXiv:1701.08071, 2017.
  18. Tianyi Zhang, “On fine-grained temporal emotion recognition in video: How to trade off recognition accuracy with annotation complexity?,” 2022.
  19. “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
  20. “Frame vs. turn-level: emotion recognition from speech considering static and dynamic processing,” in Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007 Lisbon, Portugal, September 12-14, 2007 Proceedings 2. Springer, 2007, pp. 139–147.
  21. “Speech emotion recognition using deep neural network and extreme learning machine,” in Interspeech 2014, 2014.
  22. “Towards real-time speech emotion recognition using deep neural networks,” in 2015 9th international conference on signal processing and communication systems (ICSPCS). IEEE, 2015, pp. 1–5.
  23. “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2227–2231.
  24. “An attention pooling based representation learning method for speech emotion recognition,” 2018.
  25. “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
  26. “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  27. “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english,” PloS one, vol. 13, no. 5, pp. e0196391, 2018.
  28. “A database of german emotional speech.,” in Interspeech, 2005, vol. 5, pp. 1517–1520.
  29. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  30. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
  31. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  32. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  33. “Boosting self-supervised embeddings for speech enhancement,” Interspeech 2022, 2022.
  34. Paul Ekman, “Universals and cultural differences in facial expressions of emotion.,” in Nebraska symposium on motivation. University of Nebraska Press, 1971.
  35. “The emotional voices database: Towards controlling the emotion dimension in voice generation systems,” arXiv preprint arXiv:1806.09514, 2018.
  36. “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 920–924.
  37. “An open source emotional speech corpus for human robot interaction applications,” Interspeech 2018, 2018.
Citations (9)

Summary

We haven't generated a summary for this paper yet.