Acoustic and linguistic representations for speech continuous emotion recognition in call center conversations (2310.04481v1)
Abstract: The goal of our research is to automatically retrieve the satisfaction and the frustration in real-life call-center conversations. This study focuses an industrial application in which the customer satisfaction is continuously tracked down to improve customer services. To compensate the lack of large annotated emotional databases, we explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Moreover, several studies have pointed out that emotion can be detected not only in speech but also in facial trait, in biological response or in textual information. In the context of telephone conversations, we can break down the audio information into acoustic and linguistic by using the speech signal and its transcription. Our experiments confirms the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction and best generalizes to unseen data. Our experiments conclude to the definitive advantage of using CamemBERT representations, however the benefit of the fusion of acoustic and linguistic modalities is not as obvious. With models learnt on individual annotations, we found that fusion approaches are more robust to the subjectivity of the annotation task. This study also tackles the problem of performances variability and intends to estimate this variability from different views: weights initialization, confidence intervals and annotation subjectivity. A deep analysis on the linguistic content investigates interpretable factors able to explain the high contribution of the linguistic modality for this task.
- K. Cheong, J. Kim, and S. So, “A study of strategic call center management: relationship between key performance indicators and customer satisfaction,” European Journal of Social Sciences, vol. 6, no. 2, pp. 268–276, 2008.
- J. A. Russell, “A circumplex model of affect.” Journal of personality and social psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
- K. R. Scherer, “What are emotions? and how can they be measured?” Social science information, vol. 44, no. 4, pp. 695–729, 2005.
- H. Schlosberg, “Three dimensions of emotion.” Psychological review, vol. 61, no. 2, pp. 81–88, 1954.
- Y. Alva M, N. Muthuraman, and J. Paulose, “A comprehensive survey on features and methods for speech emotion detection,” in Proc. of IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, Tamilnadu, India, 2015, pp. 1–6.
- M. Wöllmer, M. Kaiser, E. F., B. Schuller, and G. Rigoll, “LSTM-modeling of continuous emotions in an audiovisual affect recognition framework,” Image and Vision Computing, vol. 31, no. 2, pp. 153–163, 2013.
- F. Alam and G. Riccardi, “Fusion of acoustic, linguistic and psycholinguistic features for speaker personality traits recognition,” in Proc. of ICASSP, Florence, Italy, 2014, pp. 955–959.
- P. Atrey, M. Hossain, A. El Saddik, and M. Kankanhalli, “Multimodal fusion for multimedia analysis: A survey,” Multimedia Syst., vol. 16, no. 1, pp. 345–379, 2010.
- Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. of ACL, Melbourne, Australia, 2018, p. 2247–2256.
- M. Macary, M. Tahon, Y. Estève, and A. Rousseau, “AlloSat: A new call center french corpus for satisfaction and frustration analysis,” in Proc. of Language Resources and Evaluation Conference (LREC), Virtual Conference, 2020, pp. 1590–1597.
- A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir, “Whodunnit – searching for the most important feature types signalling emotion-related user states in speech,” Computer Speech & Language, vol. 25, no. 1, pp. 4–28, 2011.
- F. Eyben, K. Scherer, B. Schuller, J. Sundberg, E. André et al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016.
- M. Tahon and L. Devillers, “Towards a small set of robust acoustic features for emotion recognition: Challenges,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 16–28, 2016.
- D. Ververidis, C. Kotropoulos, and I. Pitas, “Automatic emotional speech classification,” in Proc. of ICASSP, Montreal, Canada, 2004, pp. 593–596.
- T. Vogt and E. Andre, “Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition,” in International Conference on Multimedia and Expo, Amsterdam, Netherlands, 2005, pp. 474–477.
- O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, “Emotion recognition by speech signals,” in European Conference on Speech Communication and Technology, Geneva, Switzerland, 2003, pp. 125–128.
- B. Schuller, S. Steidl, A. Batliner, J. Krajewski, J. Epps, F. Eyben, F. Ringeval, and al., “The INTERSPEECH 2014 computational paralinguistics challenge: Cognitive & physical load,” in Proc. of INTERSPEECH, Singapore, 2014, pp. 427–431.
- B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer et al., “The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism,” in Proc. of INTERSPEECH, Lyon, France, 2013, pp. 148–152.
- S. Vanaja and M. Belwal, “Aspect-level sentiment analysis on e-commerce data,” in Proc. of International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, Tamil Nadu, India, 2018, pp. 1275–1279.
- S. Dhar, S. Pednekar, K. Borad, and A. Save, “Sentiment analysis using neural networks: A new approach,” in Proc. of International Conference on Inventive Communication and Computational Technologies (ICICCT), New Delhi, India, 2018, pp. 1220–1224.
- S. Poria, A. Gelbukh, A. Hussain, D. Das, and S. Bandyopadhyay, “Enhanced SenticNet with affective labels for concept-based opinion mining,” IEEE Intelligent Systems, vol. 28, p. 31–38, 2013.
- C. Monnier and A. Syssau, “Affective norms for French words (FAN).” Behavior Research Methods, vol. 46, no. 4, pp. 1128–1137, 2014.
- H. Meisheri and L. Dey, “TCS research at SemEval-2018 task 1: Learning robust representations using multi-attention architecture,” in Proc. of The International Workshop on Semantic Evaluation, New Orleans, Louisiana, USA, 2018, pp. 291–299”.
- S. Chaffar and D. Inkpen, “Using a heterogeneous dataset for emotion analysis in text,” in Proc of Advances in Artificial Intelligence, St. John’s, NF, Canada, 2011, pp. 62–67.
- B. W. Schuller, “Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends,” Communication of ACM, vol. 61, no. 5, p. 90–99, 2018.
- J. Lee and I. Tashev, “High-level feature representation using recurrent neural network for speech emotion recognition,” in Proc. of INTERSPEECH, Dresden, Germany, 2015, pp. 1537–1540.
- G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, and al., “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proc. of ICASSP, Shanghai, China, 2016, pp. 5200–5204.
- M. Schmitt and B. Schuller, “Deep recurrent neural networks for emotion recognition in speech,” in DAGA, Munich, Germany, 2018, pp. 1537–1540.
- M. Schmitt, N. Cummins, and B. W. Schuller, “Continuous emotion recognition in speech - do we need recurrence?” in Proc. of INTERSPEECH, Graz, Austria, 2019, pp. 2808–2812.
- M. Macary, M. Lebourdais, M. Tahon, Y. Estève, and A. Rousseau, “Multi-corpus experiment on continuous speech emotion recognition: convolution or recurrence?” in Proc. of Conference on Speech and Computer (SPECOM), Virtual Conference, 2020.
- B. T. Atmaja and M. Akagi, “Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning,” APSIPA Transactions on Signal and Information Processing, vol. 9, no. e17, pp. 1–12, 2020.
- S. Yoon, S. Byun, and K. Jung, “Multimodal speech emotion recognition using audio and text,” in Proc. of Spoken Language Technologies Workshop (SLT), Athens, Greece, 2018, pp. 112–118.
- S. Sahu, V. Mitra, N. Seneviratne, and C. Y. Espy-Wilson, “Multi-modal learning for speech emotion recognition: An analysis and comparison of asr outputs with ground truth transcription.” in Proc. of INTERSPEECH, Graz, Austria, 2019, pp. 3302–3306.
- J. Sebastian and P. Pierucci, “Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts,” in Proc. of INTERSPEECH, Graz, Austria, 2019, pp. 51–55.
- S. Planet and I. Iriondo, “Comparison between decision-level and feature-level fusion of acoustic and linguistic features for spontaneous emotion recognition,” in Proc. of the Iberian Conference on Information Systems and Technologies (CISTI), Madrid, Spain, 2012, pp. 1–6.
- J. Cai, Z. Meng, A. S. Khan, Z. Li, J. O’Reilly, S. Han, P. Liu, M. Chen, and Y. Tong, “Feature-level and model-level audiovisual fusion for emotion recognition in the wild,” in Proc. of Multimedia Information Processing and Retrieval (MIPR), San Jose, California, USA, 2019, pp. 443–448.
- S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” in Proc. of the Audio/Visual Emotion Challenge and Workshop (AVEC), Mountain View, California, USA, 2017, p. 19–26.
- S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Information Fusion, vol. 37, pp. 98–125, 2017.
- C. Wu and W. Liang, “Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels,” IEEE Transactions on Affective Computing, vol. 2, no. 1, pp. 10–21, 2011.
- A. Barhoumi, N. Camelin, C. Aloulou, Y. Estève, and L. Hadrich Belguith, “Toward qualitative evaluation of embeddings for Arabic sentiment analysis,” in Proc. of Language Resources and Evaluation Conference (LREC), Virtual Conference, 2020, pp. 4955–4963”.
- B. T. Atmaja, K. Shirai, and M. Akagi, “Speech emotion recognition using speech feature and word embedding,” in Proc. of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 2019, pp. 519–523.
- P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, and al., “Speech emotion recognition using spectrogram and phoneme embedding,” in Proc. of INTERSPEECH, Hyderabad, India, 2018, pp. 3688–3692.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, Minnesota, USA, 2019, pp. 4171–4186.
- J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu et al., “Libri-light: A benchmark for ASR with limited or no supervision,” in Proc. of ICASSP, Virtual Conference, 2020, pp. 7669–7673.
- A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in Proc. of ICASSP, Virtual Conference, 2020, pp. 6419–6423.
- H. Nguyen, F. Bougares, N. Tomashenko, Y. Estève et al., “Investigating self-supervised pre-training for end-to-end speech translation,” in Proc. of the workshop on Self-supervision in Audio and Speech at the International Conference on Machine Learning (ICML), Virtual Conference, 2020.
- S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training for speech recognition,” in Proc. of INTERSPEECH, Graz, Austria, 2019, pp. 3465–3469.
- P. Chi, P. Chung, T. Wu, C. Hsieh, S. Li et al., “Audio AlBERT: A lite BERT for self-supervised learning of audio representation,” in Pre-print on arXiv/2005.08575, 2020.
- H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux et al., “FlauBERT: Unsupervised language model pre-training for French,” in Proc. of Language Resources and Evaluation Conference (LREC), Virtual Conference, 2020, pp. 2479–2490.
- L. Devillers and L. Vidrascu, “Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs,” in Proc. of INTERSPEECH, Pittsburgh, Pennsylvanie, USA, 2006, pp. 801–804.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Pre-print on arXiv/1301.3781, 2013.
- L. Martin, B. Muller, P. J. Ortiz Suárez, Y. Dupont, L. Romary et al., “CamemBERT: a tasty French language model,” in Proc. of ACL, Virtual Conference, 2020, pp. 7203–7219.
- D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in International Conference on Learning Representations, New Orleans, Louisiana, USA, 2019, pp. 1–16.
- L. Devillers, C. Vaudable, and C. Chasatgnol, “Real-life emotion-related states detection in call centers: a cross-corpora study,” in Proc. of INTERSPEECH, Makuhari, Chiba, Japan, 2010, pp. 2350–2355.
- K. M. Morrison, “Natural resources, aid, and democratization: A best-case scenario,” Public Choice, vol. 131, no. 3-4, pp. 365–386, 2007.
- G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schröoder, “The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 5–17, 2012.
- F. Ringeval, A. Sonderegger, J. S. Sauer, and D. Lalanne, “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions,” 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8, 2013.
- J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt et al., “SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild,” IEEE transactions on pattern analysis and machine intelligence, pp. 1–1, 2019.
- F. Ringeval, B. Schuller, M. Valstar, R. Cowie, H. Kaya et al., “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition,” in Proc. of the Audio/Visual Emotion Challenge and Workshop (AVEC), Beijing, China, 2018, pp. 3–13.
- S. Zahorian and H. Hu, “A spectral/temporal method for robust fundamental frequency tracking,” The Journal of the Acoustical Society of America, vol. 123, no. 6, pp. 4559–4571, 2008.
- L.-K. Lin, “A concordance correlation coefficient to evaluate reproducibility,” Biometrics, vol. 45, no. 1, pp. 255–268, 1989.
- J. J. Z. Liao and J. W. Lewis, “A note on concordance correlation coefficient,” PDA Journal of Pharmaceutical Science and Technology, vol. 54, no. 1, pp. 23–26, 2000.
- G. B. McBride, “A Proposal for Strength-of-Agreement Criteria for Lin’s Concordance Correlation Coefficient,” National Institute of Water & Atmospheric Research Ltd, Tech. Rep., 2005.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury et al., “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada, 2019, pp. 8024–8035.
- M. Macary, M. Tahon, Y. Estève, and A. Rousseau, “On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition,” in IEEE Spoken Language Technology Workshop (SLT), Virtual, China, Jan. 2021.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. of ICASSP, South Brisbane, Queensland, Australia, 2015, pp. 5206–5210.
- R. Řehůřek and P. Sojka, “Software framework for topic modelling with large corpora,” in Proc. of the LREC Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 2010, pp. 45–50.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi et al., “RoBERTa: A robustly optimized BERT pretraining approach,” in Pre-print on arXiv/1907.11692, 2019.
- P. J. Ortiz Suárez, B. Sagot, and L. Romary, “Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures,” in Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Cardiff, United Kingdom, 2019, pp. 9–16.
- S. Evain, H. Nguyen, H. Le, M. Z. Boito, S. Mdhaffar, S. Alisamir, Z. Tong, N. Tomashenko, M. Dinarelli, T. Parcollet, A. Allauzen, Y. Estève, B. Lecouteux, F. Portet, S. Rossato, F. Ringeval, D. Schwab, and laurent besacier, “Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmark,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- B. Pallaud, R. Bertrand, P. Blache, L. Prévot, and S. Rauzy, “Suspensive and Disfluent Self Interruptions in French Language Interactions,” in Fluency and Disfluency across Languages and Language Varieties, ser. Corpora and Language in use, P. U. de Louvain, Ed., 2019, no. 4.
- A. Nasr, F. Béchet, J.-F. Rey, B. Favre, and J. Le Roux, “Macaon: An nlp tool suite for processing word lattices,” in Proc. of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations (HLT). USA: Association for Computational Linguistics, 2011, p. 86–91.
- Manon Macary (2 papers)
- Marie Tahon (13 papers)
- Yannick Estève (45 papers)
- Daniel Luzzati (1 paper)