Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus
Abstract: The emotion detection technology to enhance human decision-making is an important research issue for real-world applications, but real-life emotion datasets are relatively rare and small. The experiments conducted in this paper use the CEMO, which was collected in a French emergency call center. Two pre-trained models based on speech and text were fine-tuned for speech emotion recognition. Using pre-trained Transformer encoders mitigates our data's limited and sparse nature. This paper explores the different fusion strategies of these modality-specific models. In particular, fusions with and without cross-attention mechanisms were tested to gather the most relevant information from both the speech and text encoders. We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality and that the Symmetric multi-headed cross-attention mechanism performed better than late classical fusion approaches. Our experiments also suggest that for the real-life CEMO corpus, the audio component encodes more emotive information than the textual one.
- “Multi-layer Dialogue Annotation for Automated Multilingual Customer Service,” ISLE, 2003.
- “Challenges in real-life emotion annotation and machine learning based detection,” Neural networks: INNS, 2005.
- “LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech,” in INTERSPEECH, 2021.
- “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Advances in Neural Inform. Process. Systems, 2020.
- “FlauBERT: Unsupervised Lang. Model Pre-training for French,” in Twelfth Lang. Resources and Evaluation Conf., May 2020.
- “BERT: Pre-training of Deep Bidirectional Transformers for Lang. Understanding,” in NAACL: Human Lang. Technologies, 2019.
- Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap, 2022.
- Transformer Models for Text-based Emotion Detection: A Review of BERT-based Approaches, AI Review, 2021.
- “Attention is All you Need,” in Advances in Neural Inform. Process. Systems, 2017.
- “Long Short-Term Memory-Networks for Machine Reading,” in Proc. of Empirical Methods in Natural Lang. Process., 2016.
- “Neural machine trans. by jointly learning to align and translate,” in ICLR, 2015.
- “Speech Emotion Classification Using Attention-Based LSTM,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
- “Self-attention transfer networks for speech emotion recognition,” Virtual Reality & Intelligent Hardware, 2021.
- “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- “Speech Emotion Recognition Using Semantic Inform.,” in ICASSP, 2021.
- “Self-Attention for Speech Emotion Recognition,” in INTERSPEECH, 2019, pp. 2578–2582.
- “EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings,” IEEE Trans. on Affect. Comput., 2021.
- Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition., 2008.
- Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Inform., 2004.
- “Strength Modelling for Real-World Automatic Continuous Affect Recognition from Audiovisual Signals,” Image and Vision Computing, pp. 76–86, 2017.
- “Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition,” in Proc. of the First International Workshop on Natural Lang. Process. Beyond Text, 2020, pp. 1–10.
- “Efficient Low-rank Multimodal Fusion With Modality-Specific Factors,” in Proc. of the 56th ACL, 2018, pp. 2247–2256.
- “A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions,” IEEE transactions on pattern analysis and machine intelligence, IEEE Trans. on Pattern Analysis and Machine Int., 2009.
- “Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network,” IEEE Access, 2020.
- “End-to-end multimodal affect recognition in real-world environments,” Inform. Fusion, 2021.
- “Prediction-based learning for continuous emotion recognition in speech,” in IEEE ICASSP, 2017.
- “Attention Bottlenecks for Multimodal Fusion,” in Advances in Neural Inform. Process. Systems, 2021.
- “Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition,” in IEEE ICASSP, 2021.
- “End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings,” in ACII, 2021.
- “Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.,” in ICMI, 2022.
- “Neural Machine Trans. of Rare Words with Subword Units,” in Proc. of the 54th ACL, 2016, pp. 1715–1725.
- “Adam: A Method for Stochastic Optimization,” ICLR, 2015.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.