Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference (2310.10179v1)
Abstract: Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best rho value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities.
- Sequence to Sequence Autoencoders for Unsupervised Representation Learning from Audio. In Workshop on Detection and Classification of Acoustic Scenes and Events.
- Snore Sound Classification Using Image-Based Deep Spectrum Features. In Interspeech 2017. ISCA, 3512–3516.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
- Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Interspeech 2021. 2426–2430. https://doi.org/10.21437/Interspeech.2021-329
- Mapping 24 emotions conveyed by brief human vocalization. American Psychologist 74, 6 (2019), 698.
- Towards end-to-end spoken intent recognition in smart home. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). IEEE, 1–8.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423
- audeep: Unsupervised learning of representations from audio with deep recurrent neural networks. The Journal of Machine Learning Research 18, 1 (2017), 6340–6344.
- Wav2vec2-based speech rating system for children with speech sound disorder. In Interspeech.
- Ethan Goan and Clinton Fookes. 2020. Bayesian Neural Networks: An Introduction and Survey. Springer International Publishing, Cham, 45–87. https://doi.org/10.1007/978-3-030-42553-1_3
- Wav2vec2-based paralinguistic systems to recognise vocalised emotions and stuttering. In Proceedings of the 30th ACM International Conference on Multimedia. 7026–7029.
- Stochastic variational inference. Journal of Machine Learning Research (2013).
- Prediction of User Request and Complaint in Spoken Customer-Agent Conversations. arXiv preprint arXiv:2208.10249 (2022).
- Graddiv: Adversarial robustness of randomized neural networks via gradient diversity regularization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
- CamemBERT: a Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ACL, Online. https://doi.org/10.18653/v1/2020.acl-main.645
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
- Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Interspeech 2021. 3400–3404. https://doi.org/10.21437/Interspeech.2021-703
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).
- SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).
- The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests. In Proceedings of the 31. ACM International Conference on Multimedia, MM 2023. ACM, ACM, Ottawa, Canada. 5 pages.
- Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5754–5758.
- Introducing ECAPA-TDNN and Wav2Vec2. 0 embeddings to stuttering detection. arXiv preprint arXiv:2204.01564 (2022).
- Joel Shor and Subhashini Venugopalan. 2022. TRILLsson: Distilled Universal Paralinguistic Speech Representations. In Interspeech 2022. 356–360. https://doi.org/10.21437/Interspeech.2022-118
- FUNCTIONAL VARIATIONAL BAYESIAN NEURAL NETWORKS. In International Conference on Learning Representations. https://openreview.net/forum?id=rkxacs0qY7
- Nik Vaessen and David A Van Leeuwen. 2022. Fine-tuning wav2vec2 for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7967–7971.