End-to-End Label Uncertainty Modeling in Speech Emotion Recognition using Bayesian Neural Networks and Label Distribution Learning (2209.15449v2)
Abstract: To train machine learning algorithms to predict emotional expressions in terms of arousal and valence, annotated datasets are needed. However, as different people perceive others' emotional expressions differently, their annotations are subjective. To account for this, annotations are typically collected from multiple annotators and averaged to obtain ground-truth labels. However, when exclusively trained on this averaged ground-truth, the model is agnostic to the inherent subjectivity in emotional expressions. In this work, we therefore propose an end-to-end Bayesian neural network capable of being trained on a distribution of annotations to also capture the subjectivity-based label uncertainty. Instead of a Gaussian, we model the annotation distribution using Student's t-distribution, which also accounts for the number of annotations available. We derive the corresponding Kullback-Leibler divergence loss and use it to train an estimator for the annotation distribution, from which the mean and uncertainty can be inferred. We validate the proposed method using two in-the-wild datasets. We show that the proposed t-distribution based approach achieves state-of-the-art uncertainty modeling results in speech emotion recognition, and also consistent results in cross-corpora evaluations. Furthermore, analyses reveal that the advantage of a t-distribution over a Gaussian grows with increasing inter-annotator correlation and a decreasing number of annotations available.
- G. A. Van Kleef, “How emotions regulate social life: The emotions as social information (easi) model,” Current directions in psychological science, vol. 18, no. 3, pp. 184–188, 2009.
- B. W. Schuller, “Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends,” Communications of the ACM, vol. 61, no. 5, pp. 90–99, Apr. 2018.
- J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, p. 1161, 1980.
- D. Dukes, K. Abrams, R. Adolphs, M. E. Ahmed, A. Beatty, K. C. Berridge, S. Broomhall, T. Brosch, J. J. Campos, Z. Clay et al., “The rise of affectivism,” Nature Human Behaviour, pp. 1–5, 2021.
- K. Sridhar, W.-C. Lin, and C. Busso, “Generative approach using soft-labels to learn uncertainty in predicting emotional attributes,” in IEEE Int. Conf. on Affective Comp. and Intelligent Interaction, Virtual Event, Oct. 2021, pp. 1–8.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- M. Grimm and K. Kroschel, “Evaluation of natural emotions using self assessment manikins,” in IEEE Workshop on Automatic Speech Recognition and Understanding, Jan. 2005, pp. 381–385.
- H. Gunes and B. Schuller, “Categorical and dimensional affect analysis in continuous input: Current trends and future directions,” Image and Vision Computing, vol. 31, pp. 120–136, 2013.
- J. Han, Z. Zhang, Z. Ren, and B. Schuller, “Exploring perception uncertainty for emotion recognition in dyadic conversation and music listening,” Cognitive Computation, vol. 13, Mar. 2021.
- K. Sridhar and C. Busso, “Modeling uncertainty in predicting emotional attributes from spontaneous speech,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Barcelona, Spain, May 2020.
- P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-End Speech Emotion Recognition Using Deep Neural Networks,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Calgary, Canada, Apr. 2018.
- J. Huang, Y. Li, J. Tao, Z. Lian, and J. Yi, “End-to-end continuous emotion recognition from video using 3D ConvLSTM networks,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Apr. 2018.
- P. Tzirakis, J. Chen, S. Zafeiriou, and B. Schuller, “End-to-end multimodal affect recognition in real-world environments,” Information Fusion, vol. 68, pp. 46–53, 2021.
- S. Alisamir and F. Ringeval, “On the evolution of speech representations for affective computing: A brief history and critical overview,” IEEE Signal Proc., Magazine, vol. 38, pp. 12–21, 2021.
- A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in Advances in Neural Inf. Proc. Sys., NeurIPS, vol. 30, Dec. 2017.
- R. Zheng, S. Zhang, L. Liu, Y. Luo, and M. Sun, “Uncertainty in Bayesian deep label distribution learning,” Applied Soft Computing, vol. 101, Mar. 2021.
- M. K. Tellamekala, T. Giesbrecht, and M. Valstar, “Dimensional affect uncertainty modelling for apparent personality recognition,” IEEE Tran. on Affective Computing, Jul. 2022.
- J. Liu, J. Paisley, M.-A. Kioumourtzoglou, and B. Coull, “Accurate uncertainty estimation and decomposition in ensemble learning,” in Advances in Neural Inf. Proc. Sys., NeurIPS, Vancouver, Dec. 2019.
- S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. Eslami, D. Jimenez Rezende, and O. Ronneberger, “A probabilistic U-Net for segmentation of ambiguous images,” in Advances in Neural Inf. Proc. Sys., NeurIPS, Montreal, Canada, Dec. 2018.
- M. K. Tellamekala, E. Sanchez, G. Tzimiropoulos, T. Giesbrecht, and M. Valstar, “Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition,” in Interspeech, Brno, Sep. 2021.
- M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami, “Conditional neural processes,” in Int. Conf. Machine Learning (ICML), Stockholm, Sweden, Jul. 2018.
- Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” in Int. Conf. Machine Learning (ICML), New York City, NY, USA, Jun. 2016.
- C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in Int. Conf. Machine Learning (ICML), Lille, France, Jul. 2015.
- H. Fang, T. Peer, S. Wermter, and T. Gerkmann, “Integrating statistical uncertainty into neural network-based speech enhancement,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Singapore, Jan. 2022.
- B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng, “Deep label distribution learning with label ambiguity,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2825–2838, 2017.
- H.-C. Chou, W.-C. Lin, C.-C. Lee, and C. Busso, “Exploiting annotators’ typed description of emotion perception to maximize utilization of ratings for speech emotion recognition,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Singapore, Jan. 2022.
- N. M. Foteinopoulou, C. Tzelepis, and I. Patras, “Estimating continuous affect with label uncertainty,” in IEEE Int. Conf. on Affective Comp. and Intelligent Interaction, Virtual Event, Oct. 2021.
- C. Villa and F. J. Rubio, “Objective priors for the number of degrees of freedom of a multivariate t distribution and the t-copula,” Computational Statistics & Data Analysis, vol. 124, pp. 197–219, 2018.
- L. Martinez-Lucas, M. Abdelwahab, and C. Busso, “The MSP-conversation corpus,” in Interspeech, Shanghai, China, Oct. 2020.
- R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Tran. on Affective Computing, vol. 10, no. 4, pp. 471–483, Dec. 2019.
- F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” in IEEE Int. Conf. and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, Apr. 2013.
- J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval, J. Han, V. Pandit, A. Toisoul, B. Schuller et al., “Sewa db: A rich database for audio-visual emotion and sentiment research in the wild,” IEEE Trans., on pattern analysis and machine intelligence, vol. 43, no. 3, pp. 1022–1040, 2019.
- N. Raj Prabhu, C. Raman, and H. Hung, “Defining and Quantifying Conversation Quality in Spontaneous Interactions,” in Comp. Pub. of 2020 Int. Conf. on Multimodal Interaction, Sep. 2020.
- D. Gatica-Perez, “Automatic nonverbal analysis of social interaction in small groups: A review,” Image and Vision Computing, vol. 27, no. 12, pp. 1775–1787, Nov. 2009.
- M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge,” in Proc., of the 6th Int., Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 2016.
- N. Raj Prabhu, G. Carbajal, N. Lehmann-Willenbrock, and T. Gerkmann, “End-to-end label uncertainty modeling for speech-based arousal recognition using Bayesian neural networks,” in Interspeech, Incheon, Korea, September 2022.
- N. Raj Prabhu, N. Lehmann-Willenbrock, and T. Gerkmann, “Label uncertainty modeling and prediction for speech emotion recognition using t-distributions,” in IEEE Int. Conf. on Affective Comp. and Intelligent Interaction, Nara, Japan, Oct. 2022.
- M. Abdelwahab and C. Busso, “Active learning for speech emotion recognition using deep neural network,” in IEEE Int. Conf. on Affective Comp. and Intelligent Interaction, Cambridge, UK, Sep. 2019.
- I. Lawrence and K. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, pp. 255–268, 1989.
- S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, “”of all things the measure is man” automatic classification of emotions and inter-labeler consistency,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Philadelphia, USA, 2005.
- H. M. Fayek, M. Lech, and L. Cavedon, “Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels,” in IEEE Int., Joint Conf., on Neural Networks (IJCNN), Vancouver, Canada, Jul. 2016.
- L. Tarantino, P. N. Garner, and A. Lazaridis, “Self-attention for speech emotion recognition,” in Interspeech, Graz, Sep. 2019.
- J. Han, Z. Zhang, M. Schmitt, M. Pantic, and B. Schuller, “From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty,” in Proc., of the 25th ACM Int. Conf. on Multimedia, Mountain View, USA, Oct. 2017.
- T. Dang, V. Sethu, and E. Ambikairajah, “Dynamic multi-rater gaussian mixture regression incorporating temporal dependencies of emotion uncertainty using kalman filters,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Calgary, Canada, Apr. 2018.
- G. Rizos and B. Schuller, “Modelling sample informativeness for deep affective computing,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Brighton, UK, May 2019.
- C. Villa and S. G. Walker, “Objective prior for the number of degrees of freedom of at distribution,” Bayesian Analysis, vol. 9, no. 1, pp. 197–220, 2014.
- S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Inf. Proc. Sys., NeurIPS, Vancouver, Dec. 2019.
- D. Rees, “Essential statistics,” American Statistician, vol. 55, 2001.
- P. Tzirakis, A. Nguyen, S. Zafeiriou, and B. W. Schuller, “Speech emotion recognition using semantic information,” in IEEE Int. Conf. on Acoustics, Speech, Sig. Proc., ICASSP, Toronto, Jun. 2021.
- S. Butterworth et al., “On the theory of filter amplifiers,” Wireless Engineer, vol. 7, no. 6, pp. 536–541, 1930.
- J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965.
- R. Gupta, K. Audhkhasi, Z. Jacokes, A. Rozga, and S. Narayanan, “Modeling multiple time series annotations as noisy distortions of the ground truth: An expectation-maximization approach,” IEEE Tran. on Affective Computing, vol. 9, no. 1, pp. 76–89, 2016.
- S. Mariooryad and C. Busso, “Correcting time-continuous emotional labels by modeling the reaction lag of evaluators,” IEEE Tran. on Affective Computing, vol. 6, no. 2, pp. 97–108, 2014.
- K. Sridhar and C. Busso, “Unsupervised personalization of an emotion recognition system: The unique properties of the externalization of valence in speech,” IEEE Transactions on Affective Computing, pp. 1–17, Jun. 2022.
- C. Raman, N. Raj Prabhu, and H. Hung, “Perceived conversation quality in spontaneous interactions,” IEEE Tran. on Affective Computing, pp. 1–13, Jan. 2023.
- F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Tran. on Affective Computing, vol. 7, no. 2, pp. 190–202, Jul. 2015.
- A. Malinin and M. Gales, “Predictive uncertainty estimation via prior networks,” Advances in Neural Inf. Proc. Sys., NeurIPS, vol. 31, 2018.