End-to-End Multimodal Emotion Recognition using Deep Neural Networks (1704.08619v1)

Published 27 Apr 2017 in cs.CV and cs.CL

Abstract: Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a Convolutional Neural Network (CNN) to extract features from the speech, while for the visual modality a deep residual network (ResNet) of 50 layers. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, Long Short-Term Memory (LSTM) networks are utilized. The system is then trained in an end-to-end fashion where - by also taking advantage of the correlations of the each of the streams - we manage to significantly outperform the traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.

Authors (5)

Panagiotis Tzirakis (24 papers)
George Trigeorgis (7 papers)
Mihalis A. Nicolaou (17 papers)
Björn Schuller (83 papers)
Stefanos Zafeiriou (137 papers)

Citations (510)

View on Semantic Scholar

Summary

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

The paper "End-to-End Multimodal Emotion Recognition Using Deep Neural Networks," authored by Panagiotis Tzirakis et al., explores an integrated approach to emotion recognition by leveraging both auditory and visual modalities. The paper targets the challenging task of affective state prediction, essential for applications spanning multimedia retrieval and human-computer interaction.

The authors propose an end-to-end deep learning framework that forgoes traditional hand-engineered features, instead employing a combination of Convolutional Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for temporal modeling. The CNNs process raw audio signals and facial visual information, with a ResNet-50 architecture applied to visual data, and specialized CNN layers for speech. This approach allows the derivation of robust features directly from raw data, adapting specifically to the task requirements.

Key Contributions

End-to-End Framework: The paper introduces a pipeline whereby emotion recognition is treated as an integrated task, aligning audio and visual cues within a singular learning paradigm. The system is trained directly on raw inputs, which contrasts with conventional methods relying on preprocessed features like MFCCs or facial landmarks.
Model Architecture: The paper uses a state-of-the-art ResNet-50 deep residual network for the visual domain and a tailored CNN for audio, capturing expressive characteristics. These are followed by LSTM layers that model the temporal dependencies crucial for emotion context understanding.
Objective Function: The authors utilize an objective function based on the concordance correlation coefficient (CCC) rather than traditional mean square error (MSE), optimizing the model on a metric more reflective of human emotion perception.
Unimodal and Multimodal Evaluation: Empirical results from the RECOLA database demonstrate the superiority of this methodology over baseline systems. The multimodal approach achieves notable performance on both arousal and valence dimensions, with strong results for each unimodal network individually as well.

Results and Implications

The framework delivers superior outcomes compared to traditional methods on the RECOLA database, used within the AVEC 2016 challenge. Particularly, the multimodal system presents significant gains in predicting valence, traditionally more challenging to infer than arousal.

The examination of LSTM gate activations reveals correlations with typical prosodic features, underlining the ability of deep learning models to implicitly capture known affective markers within speech, such as loudness and pitch. This insight into feature learning offers potential pathways to refine model interpretability and performance.

Future Directions

Prospects for further exploration include:

Integration of Additional Modalities: Expanding inputs to include physiological or behavioral data may enrich the emotional context modeling.
Application in Real-world Systems: Employability in diverse applications such as virtual assistants or therapeutic diagnostics could be explored by adapting the framework to different domains.
Investigating Generalizability: Cross-dataset validation to assess robustness and applicability across varied cultural and situational contexts.

By advancing the understanding of multimodal deep learning models in emotion recognition, the authors contribute a significant step forward in affective computing. This aligns future explorations toward holistic, context-aware interaction systems in artificial intelligence.

PDF Markdown