Temporal Multimodal Fusion for Video Emotion Classification in the Wild (1709.07200v1)

Published 21 Sep 2017 in cs.CV, cs.LG, and cs.MM

Abstract: This paper addresses the question of emotion classification. The task consists in predicting emotion labels (taken among a set of possible labels) best describing the emotions contained in short video clips. Building on a standard framework -- lying in describing videos by audio and visual features used by a supervised classifier to infer the labels -- this paper investigates several novel directions. First of all, improved face descriptors based on 2D and 3D Convo-lutional Neural Networks are proposed. Second, the paper explores several fusion methods, temporal and multimodal, including a novel hierarchical method combining features and scores. In addition, we carefully reviewed the different stages of the pipeline and designed a CNN architecture adapted to the task; this is important as the size of the training set is small compared to the difficulty of the problem, making generalization difficult. The so-obtained model ranked 4th at the 2017 Emotion in the Wild challenge with the accuracy of 58.8 %.

Authors (3)

Valentin Vielzeuf (17 papers)
Stéphane Pateux (17 papers)
Frédéric Jurie (27 papers)

Citations (166)

View on Semantic Scholar

Summary

Temporal Multimodal Fusion for Video Emotion Classification in the Wild

The paper "Temporal Multimodal Fusion for Video Emotion Classification in the Wild" provides an analysis of advanced methodologies for classifying emotions in videos, a task that entails naming emotions within clips using a set of predefined labels. This paper evolves around combining audio and visual inputs with a supervised classification model. The paper primarily underlines three notable contributions: enhanced facial feature descriptors, distinct fusion mechanisms for temporal and multimodal data, and a refined CNN architecture tailored for emotion recognition despite data size limitations. The proposed approach achieved an accuracy of 58.8% in the 2017 Emotion in the Wild challenge, securing the fourth position.

Methodological Overview

The framework proposed in the paper is based heavily on leveraging sophisticated neural network models for emotion recognition tasks. Enhanced facial descriptors were built using 2D and 3D Convolutional Neural Networks (CNN), which form the crux of improved face recognition capabilities in uncontrolled environments. Additionally, multiple fusion techniques were explored to enhance the understanding of diverse modalities and temporal dynamics. There is a particular emphasis on the hierarchical strategy for combining feature inputs and classifier outputs, which significantly benefits the network's ability to generalize from a small dataset.

Innovations in Temporal and Multimodal Fusion

The research investigates several novel fusion techniques, focusing on combining audio and visual cues through temporal and multimodal integration. The paper introduces a groundbreaking hierarchical approach which allows combining predictions and features across various levels—ranging from basic feature attributes to complete score integration. This approach seeks to balance unimodal strengths while exploiting cross-modal insights.

Furthermore, the paper explores improving temporal coherence within visual descriptors by proposing a combined Convolutional 3D and Long Short-Term Memory (LSTM) network. This technique capitalizes on 3D CNN’s spatial analysis strengths and LSTM’s temporal sequence learning, forming a formidable combination for tasks requiring time-dependent interpretations like emotion recognition in videos.

Empirical Findings and Results

Empirical analyses are carried out using the AFEW dataset, a prominent dataset used by the Emotion in the Wild 2017 challenge, focusing on seven discrete emotional classes under varied real-world conditions. The combination of pretrained models enhanced with transfer learning techniques alleviates overfitting—an essential technique given the limited size of the dataset. This is complemented by an architectural design that reduces model complexity, crucial for maintaining generalizability.

The paper’s results show that the fusion approach, specifically the weighted mean fusion of model predictions, significantly enhances performance, achieving a test set accuracy of 58.81%. Notably, the paper highlights the difficulty in recognizing certain emotions, notably 'disgust' and 'surprise,' within this comparative performance framework.

Implications and Future Directions

This paper's contributions have meaningful implications, both in theoretical advancements and practical applications of emotion recognition in diverse fields such as human-computer interaction, psychological assessments, and multimedia indexing. The paper illustrates the potential scalability of combining multimodal fusion techniques with existing models while maintaining interpretability and accuracy.

Future research may build upon these results by further refining the fusion strategies, especially by employing large-scale annotated datasets to train more robust models capable of real-time video emotion detection in dynamic environments. Another significant aspect involves integrating additional contextual features, such as scene background and linguistic cues, to form a holistic emotion classification system. These avenues present opportunities for significant advancements in automating nuanced emotional and social understanding within digital ecosystems.