EmoNets: Multimodal deep learning approaches for emotion recognition in video (1503.01800v2)

Published 5 Mar 2015 in cs.LG and cs.CV

Abstract: The task of the emotion recognition in the wild (EmotiW) Challenge is to assign one of seven emotions to short video clips extracted from Hollywood style movies. The videos depict acted-out emotions under realistic conditions with a large degree of variation in attributes such as pose and illumination, making it worthwhile to explore approaches which consider combinations of features from multiple modalities for label assignment. In this paper we present our approach to learning several specialist models using deep learning techniques, each focusing on one modality. Among these are a convolutional neural network, focusing on capturing visual information in detected faces, a deep belief net focusing on the representation of the audio stream, a K-Means based "bag-of-mouths" model, which extracts visual features around the mouth region and a relational autoencoder, which addresses spatio-temporal aspects of videos. We explore multiple methods for the combination of cues from these modalities into one common classifier. This achieves a considerably greater accuracy than predictions from our strongest single-modality classifier. Our method was the winning submission in the 2013 EmotiW challenge and achieved a test set accuracy of 47.67% on the 2014 dataset.

Citations (397)

View on Semantic Scholar

Summary

The paper demonstrates that fusing diverse deep learning models significantly improves emotion recognition accuracy in videos.
It introduces a multimodal architecture combining ConvNets, deep belief networks, relational autoencoders, and bag-of-mouths to effectively capture facial, audio, spatio-temporal, and motion cues.
The method achieved a test accuracy of 47.67% on the EmotiW dataset, illustrating the potential of ensemble techniques in robust emotion classification.

An Analysis of "EmoNets: Multimodal deep learning approaches for emotion recognition in video"

The paper "EmoNets: Multimodal deep learning approaches for emotion recognition in video" presents a comprehensive paper of various methodologies for emotion classification within the challenging domain of "emotion recognition in the wild." The research involves analyzing video clips from movies with an objective to classify them into seven basic emotions. The intricacies of this task arise from the real-world conditions under which the videos are filmed, including varying lighting and pose, making this a non-trivial problem in computer vision.

Summary of Approach

Multimodal Architecture:

The core of the proposed solution involves the integration of multiple specialist models, each designed to focus on distinct modalities of the data. Four major models were used:

Convolutional Neural Networks (ConvNets): These were employed to capture facial expressions from video frames, achieving the best result as a single-modality model.
Deep Belief Networks: This model focused on processing audio streams from the videos, leveraging layer-wise unsupervised feature learning to extract emotional cues from sound.
Relational Autoencoder: Designed for detecting spatio-temporal features, this model captured motion patterns imperative for recognizing dynamic cues of emotions.
Bag-of-Mouths Model: Utilized K-Means clustering to focus on mouth movements, providing additional features useful in emotion classification.

Combination Methods:

The paper describes several methods to combine predictions from these specialists to form a consensus. Remarkably, the ensemble of models significantly outperformed individual models. Their approach in combining modalities was innovative, employing strategies like averaging predictions, support vector machines (SVMs) for aggregation, and random search for optimal model weighting.

Numerical Outcomes and Insights

The method achieved a test accuracy of 47.67% on the 2014 EmotiW dataset, surpassing previous baselines. While this might not represent state-of-the-art performance in contemporary standards, it was a significant achievement relative to its time.

Fusion of Modalities:

One of the key contributions of the work is demonstrating that fusing different modalities in deep learning architecture enhances emotional classification, suggesting the robustness of multimodal approaches over single-modality methods.

Scalability and Overfitting:

The researchers deftly handled overfitting by training networks on additional expansive datasets mined from sources like Google Image Search, thereby providing lessons on the importance of diverse data sources in model training.

Implications and Future Directions

The development of EmoNets represents a meaningful contribution to the fields of computer vision and affective computing, especially in applications where automatic emotion detection can enhance user-interaction systems like virtual assistants or surveillance systems.

Looking forward, further research could explore:

Integration with Modern Architectures: Leveraging advancements in architectures such as transformers could drive performance improvements.
Innovative Data Augmentation Techniques: To address issues related to the variability inherent in real-world video data.
Real-time Processing Improvement: By optimizing models for deployability in less resource-intensive settings.

In conclusion, by effectively combining and leveraging multiple deep learning models, the paper provides a strategic framework for emotion recognition systems. This work has implications not only in computational fields but can also inform development in human-computer interaction, demonstrating that comprehensive multimodal approaches hold the key to understanding complex emotional expressions in multimedia settings.

PDF Markdown