Spatial-Temporal Recurrent Neural Network for Emotion Recognition (1705.04515v1)

Published 12 May 2017 in cs.CV

Abstract: Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the spatial region of each time slice from multiple angles. Then a bi-directional temporal RNN layer is further used to learn discriminative temporal dependencies from the sequences concatenating spatial features of each time slice produced from the spatial RNN layer. To further select those salient regions of emotion representation, we impose sparse projection onto those hidden states of spatial and temporal domains, which actually also increases the model discriminant ability because of this global consideration. Consequently, such a two-layer RNN model builds spatial dependencies as well as temporal dependencies of the input signals. Experimental results on the public emotion datasets of EEG and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.

Citations (390)

View on Semantic Scholar

Summary

The paper introduces the STRNN framework that unifies spatial and temporal analysis to improve emotion recognition from EEG and facial data.
It employs multi-directional spatial RNN and bi-directional temporal RNN layers enhanced by sparse projections to capture salient features across modalities.
Experimental outcomes on SEED and CK+ datasets report accuracies of 89.50% and 95.4%, demonstrating its superior performance over traditional methods.

Overview of the Spatial-Temporal Recurrent Neural Network for Emotion Recognition

The paper "Spatial-Temporal Recurrent Neural Network for Emotion Recognition" presents a novel approach to emotion recognition by integrating a deep learning framework specifically designed to handle spatial and temporal data. This framework, termed the Spatial-Temporal Recurrent Neural Network (STRNN), addresses the need for effective emotion recognition from electroencephalogram (EEG) signals and facial expressions captured in video sequences. The approach capitalizes on the spatial-temporal characteristics of these signals by employing recurrent neural networks (RNNs) to capture both spatial co-occurrence and temporal dependencies.

Key Contributions

The paper outlines three primary contributions:

Development of STRNN Framework: The STRNN framework is innovatively designed to unify the spatial-temporal learning of emotion data from EEG and video signals. This is accomplished through a multi-directional spatial RNN layer, which captures spatial co-occurrencies in different directions, and a bi-directional temporal RNN layer, which captures temporal dependencies over time.
Unified Emotion Recognition Framework: The research unifies EEG-based and facial expression-based emotion recognition under one deep network framework by constructing spatial-temporal volumes. This integration allows STRNN to effectively process multi-channel EEG signals and dynamic facial expressions, addressing the challenges of both modalities.
Introduction of Sparse Projections: To enhance the model’s discriminative capability, sparse projections are imposed on the hidden states within the spatial and temporal domains. This selection mechanism helps identify the most salient regions of emotion representation, boosting the overall model performance.

Methodology

The method involves two distinct RNN layers:

Spatial RNN (SRNN) Layer: This layer traverses the spatial domain (e.g., the electrodes in an EEG) in multiple directions to capture spatial dependencies. The inclusion of multi-directional RNNs within this layer ensures robustness against noise and partial occlusions.
Temporal RNN (TRNN) Layer: Following the SRNN, the TRNN layer is bi-directional, analyzing the temporal sequence both forwards and backwards. This structure captures long-range temporal dependencies and enriches emotion recognition by considering the full temporal context of the signals.

Experimental Results

The STRNN framework demonstrates competitive performance on public emotion datasets, including the SJTU Emotion EEG Dataset (SEED) and the CK+ facial expression datasets. For SEED, the framework achieved an emotion classification accuracy of 89.50%, which surpasses the performance of several conventional methods, including SVM and DBN. In the CK+ dataset, STRNN achieved a recognition accuracy of 95.4%, outperforming many state-of-the-art methods while demonstrating robust detection of salient facial expression regions.

Implications and Future Directions

The proposed STRNN framework represents a significant methodological advance in the field of emotion recognition, particularly in its ability to jointly consider spatial and temporal dimensions in a unified manner. While the primary focus is on EEG and facial expression data, the framework is theoretically adaptable to other types of spatial-temporal data, suggesting a broad applicability beyond emotion recognition.

Looking forward, potential areas for exploration include the deployment of STRNN in real-time emotion detection systems for human-computer interaction applications. Additionally, integrating more sophisticated recurrent units such as LSTM or GRU could further enhance the ability to model complex dependencies in the data. As advancements continue in the computational capabilities of neural networks, frameworks like STRNN might expand into other domains where spatial-temporal dynamics are prevalent, thus contributing to a broader range of AI applications.