Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers (2307.06090v3)

Published 12 Jul 2023 in cs.SD and eess.AS

Abstract: Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. LLMs have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  2. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  3. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  4. S. Latif, A. Zaidi, H. Cuayahuitl, F. Shamshad, M. Shoukat, and J. Qadir, “Transformers in speech processing: A survey,” arXiv preprint arXiv:2303.11607, 2023.
  5. S. Latif, “Deep representation learning for speech emotion recognition,” Ph.D. dissertation, University of Southern Queensland, 2022.
  6. C. Cioffi-Revilla and C. Cioffi-Revilla, “Computation and social science,” Introduction to computational social science: Principles and applications, pp. 35–102, 2017.
  7. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” arXiv preprint arXiv:2001.00378, 2020.
  8. S. Latif, H. S. Ali, M. Usama, R. Rana, B. Schuller, and J. Qadir, “Ai-based emotion recognition: Promise, peril, and prescriptions for prosocial path,” arXiv preprint arXiv:2211.07290, 2022.
  9. S. Latif, A. Qayyum, M. Usama, J. Qadir, A. Zwitter, and M. Shahzad, “Caveat emptor: the risks of using big data for human development,” Ieee technology and society magazine, vol. 38, no. 3, pp. 82–90, 2019.
  10. P. Röttger, B. Vidgen, D. Hovy, and J. B. Pierrehumbert, “Two contrasting data annotation paradigms for subjective nlp tasks,” arXiv preprint arXiv:2112.07475, 2021.
  11. X. Liao and Z. Zhao, “Unsupervised approaches for textual semantic annotation, a survey,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–45, 2019.
  12. C. Burns, H. Ye, D. Klein, and J. Steinhardt, “Discovering latent knowledge in language models without supervision,” arXiv preprint arXiv:2212.03827, 2022.
  13. Y. Zhu, P. Zhang, E.-U. Haq, P. Hui, and G. Tyson, “Can chatgpt reproduce human-generated labels? a study of social computing tasks,” arXiv preprint arXiv:2304.10145, 2023.
  14. F. Huang, H. Kwak, and J. An, “Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech,” arXiv preprint arXiv:2302.07736, 2023.
  15. S. Ding and R. Gutierrez-Osuna, “Group latent embedding for vector quantized variational autoencoder in non-parallel voice conversion.” in INTERSPEECH, 2019, pp. 724–728.
  16. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” arXiv preprint arXiv:2304.13712, 2023.
  17. E. Hoes, S. Altay, and J. Bermeo, “Using chatgpt to fight misinformation: Chatgpt nails 72% of 12,000 verified claims,” 2023.
  18. K.-C. Yang and F. Menczer, “Large language models can rate news outlet credibility,” arXiv preprint arXiv:2304.00228, 2023.
  19. P. Törnberg, “Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning,” arXiv preprint arXiv:2304.06588, 2023.
  20. F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
  21. T. Elmas and İ. Gül, “Opinion mining from youtube captions using chatgpt: A case study of street interviews polling the 2023 turkish elections,” arXiv preprint arXiv:2304.03434, 2023.
  22. J. Cegin, J. Simko, and P. Brusilovsky, “Chatgpt to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness,” arXiv preprint arXiv:2305.12947, 2023.
  23. T. Kuzman, I. Mozetic, and N. Ljubešic, “Chatgpt: Beginning of an end of manual linguistic data annotation? use case of automatic genre identification,” ArXiv, abs/2303.03953, 2023.
  24. M. Mets, A. Karjus, I. Ibrus, and M. Schich, “Automated stance detection in complex topics and small languages: the challenging case of immigration in polarizing news media,” arXiv preprint arXiv:2305.13047, 2023.
  25. Z. Wang, Q. Xie, Z. Ding, Y. Feng, and R. Xia, “Is chatgpt a good sentiment analyzer? a preliminary study,” arXiv preprint arXiv:2304.04339, 2023.
  26. C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang, “Can large language models transform computational social science?” arXiv preprint arXiv:2305.03514, 2023.
  27. V. Veselovsky, M. H. Ribeiro, A. Arora, M. Josifoski, A. Anderson, and R. West, “Generating faithful synthetic data with large language models: A case study in computational social science,” arXiv preprint arXiv:2305.15041, 2023.
  28. Y. Mu, B. P. Wu, W. Thorne, A. Robinson, N. Aletras, C. Scarton, K. Bontcheva, and X. Song, “Navigating prompt complexity for zero-shot classification: A study of large language models in computational social science,” arXiv preprint arXiv:2305.14310, 2023.
  29. C. M. Rytting, T. Sorensen, L. Argyle, E. Busby, N. Fulda, J. Gubler, and D. Wingate, “Towards coding social science datasets with language models,” arXiv preprint arXiv:2306.02177, 2023.
  30. M. M. Amin, E. Cambria, and B. W. Schuller, “Will affective computing emerge from foundation models and general ai? a first evaluation on chatgpt,” IEEE Intelligent Systems, vol. 38, p. 2.
  31. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  32. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  33. S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? gpt-3 can help,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 4195–4205.
  34. B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu, “How close is chatgpt to human experts? comparison corpus, evaluation, and detection,” arXiv preprint arXiv:2301.07597, 2023.
  35. A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” 2018.
  36. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Survey of deep representation learning for speech emotion recognition,” IEEE Transactions on Affective Computing, 2021.
  37. G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 5200–5204.
  38. S. Latif, R. Rana, S. Khalifa, R. Jurdak, and J. Epps, “Direct Modelling of Speech Emotion from Raw Speech,” in Proc. Interspeech 2019, 2019, pp. 3920–3924.
  39. A. Qayyum, S. Latif, and J. Qadir, “Quran reciter identification: A deep learning approach,” in 2018 7th International Conference on Computer and Communication Engineering (ICCCE).   IEEE, 2018, pp. 492–497.
  40. R. Lotfian and C. Busso, “Retrieving categorical emotions using a probabilistic framework to define preference learning samples,” in Interspeech 2016, 2016, pp. 490–494.
  41. Y. Kim and E. M. Provost, “Emotion spotting: Discovering regions of evidence in audio-visual emotion expressions,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction.   ACM, 2016, pp. 92–99.
  42. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008.
  43. C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of dyadic interactions to study emotion perception,” IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
  44. A. Burmania, S. Parthasarathy, and C. Busso, “Increasing the reliability of crowdsourcing evaluations using online quality assessment,” IEEE Transactions on Affective Computing, vol. 7, no. 4, pp. 374–388, 2016.
  45. J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, and E. M. Provost, “Progressive neural networks for transfer learning in emotion recognition,” arXiv preprint arXiv:1706.03256, 2017.
  46. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 527–536.
  47. D. Dai, Z. Wu, R. Li, X. Wu, J. Jia, and H. Meng, “Learning discriminative features from spectrograms using center loss for speech emotion recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7405–7409.
  48. J. Gideon, M. G. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, vol. 12, no. 4, pp. 1055–1068, 2019.
  49. S. Latif, R. Rana, J. Qadir, and J. Epps, “Variational autoencoders for learning latent representations of speech emotion: A preliminary study,” Proc. Interspeech 2018, pp. 3107–3111, 2018.
  50. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. W. Schuller, “Multi-task semi-supervised adversarial autoencoding for speech emotion recognition,” IEEE Transactions on Affective Computing, 2020.
  51. F. Bao, M. Neumann, and N. T. Vu, “Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition.” in INTERSPEECH, 2019, pp. 2828–2832.
  52. S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Multitask learning from augmented auxiliary data for improving speech emotion recognition,” IEEE Transactions on Affective Computing, 2022.
  53. I. Malik, S. Latif, R. Jurdak, and B. W. Schuller, “A preliminary study on augmenting speech emotion recognition using a diffusion model,” Proceedings of Interspeech, Dublin, Ireland, August, 2023, 2023.
  54. S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, Z. Deng, S. Lee, S. Narayanan, and C. Busso, “An acoustic study of emotions expressed in speech,” in Eighth International Conference on Spoken Language Processing, 2004.
  55. P. J. Fraccaro, B. C. Jones, J. Vukovic, F. G. Smith, C. D. Watkins, D. R. Feinberg, A. C. Little, and L. M. Debruine, “Experimental evidence that women speak in a higher voice pitch to men they find attractive,” Journal of Evolutionary Psychology, vol. 9, no. 1, pp. 57–67, 2011.
  56. M. Neumann and N. T. Vu, “Improving speech emotion recognition with unsupervised representation learning on unlabeled speech,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7390–7394.
  57. S. Sahu, R. Gupta, and C. Espy-Wilson, “On enhancing speech emotion recognition using generative adversarial networks,” Proc. Interspeech 2018, pp. 3693–3697, 2018.
  58. N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
  59. Z. Peng, Y. Lu, S. Pan, and Y. Liu, “Efficient speech emotion recognition using multi-scale cnn and attention,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 3020–3024.
  60. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Siddique Latif (38 papers)
  2. Muhammad Usama (40 papers)
  3. Mohammad Ibrahim Malik (1 paper)
  4. Björn W. Schuller (153 papers)
Citations (19)