Enriched Long-term Recurrent Convolutional Network for Facial Micro-Expression Recognition (1805.08417v1)

Published 22 May 2018 in cs.CV

Abstract: Facial micro-expression (ME) recognition has posed a huge challenge to researchers for its subtlety in motion and limited databases. Recently, handcrafted techniques have achieved superior performance in micro-expression recognition but at the cost of domain specificity and cumbersome parametric tunings. In this paper, we propose an Enriched Long-term Recurrent Convolutional Network (ELRCN) that first encodes each micro-expression frame into a feature vector through CNN module(s), then predicts the micro-expression by passing the feature vector through a Long Short-term Memory (LSTM) module. The framework contains two different network variants: (1) Channel-wise stacking of input data for spatial enrichment, (2) Feature-wise stacking of features for temporal enrichment. We demonstrate that the proposed approach is able to achieve reasonably good performance, without data augmentation. In addition, we also present ablation studies conducted on the framework and visualizations of what CNN "sees" when predicting the micro-expression classes.

Authors (4)

Huai-Qian Khor (8 papers)
John See (28 papers)
Raphael C. W. Phan (1 paper)
Weiyao Lin (87 papers)

Citations (161)

View on Semantic Scholar

Summary

Enriched Long-term Recurrent Convolutional Network for Facial Micro-Expression Recognition

The paper presents a novel Enriched Long-term Recurrent Convolutional Network (ELRCN) framework designed to enhance the recognition of facial micro-expressions, a task characterized by subtle facial movements and limited datasets. This research addresses the limitations of handcrafted techniques, which, despite their superior performance, often suffer from domain specificity and require laborious parameter tuning.

Framework Overview

The authors propose a dual-variant ELRCN framework that combines both Convolutional Neural Networks (CNNs) for spatial feature extraction and Long Short-term Memory (LSTM) networks for temporal dynamics learning. Two enrichment strategies are introduced:

Spatial Enrichment (SE): By channel-wise stacking of optical flow, optical strain, and grayscale raw images, this variant augments the spatial dimension to obtain a larger input data dimension, which is processed by a CNN before passing to the LSTM.
Temporal Enrichment (TE): This approach applies feature-wise stacking of features into the LSTM module, utilizing pre-trained VGG-Face weights for deeper feature extraction.

Both models omit data augmentation, a deliberate choice made to evaluate the inherent capabilities of the ELRCN architecture without enhancements from expanded datasets.

Dataset and Evaluation Protocol

The evaluation was conducted using two primary datasets: CASME II and SAMM, both recorded at high frame rates and annotated with objective classes based on action unit detection. Single-domain and cross-domain experiments were performed to assess model performance, utilizing metrics such as F1-score, Weighted Average Recall (WAR), and Unweighted Average Recall (UAR).

Significant Findings

Single Domain Experiments: On the CASME II dataset, the TE variant of ELRCN exhibited superior performance compared to both the SE variant and traditional LBP-TOP baseline, demonstrating the effectiveness of feature-wise temporal enrichment in a controlled dataset environment.

Cross Domain Experiments: In the Composite Database Evaluation (CDE), the SE variant outperformed the TE variant, revealing the potential of channel-wise input stacking when generalizing across multiple databases. This variant also surpassed baseline methods, indicating its robustness in learning from diverse and larger subject pools.

Notably, the Grad-CAM visualizations validated that the ELRCN's activations aligned with human-annotated facial action units, illustrating the framework’s ability to focus on pertinent facial regions associated with specific micro-expressions.

Theoretical and Practical Implications

From a theoretical perspective, this work enriches the understanding of integrating spatial and temporal encoding for fine-grained emotion detection tasks. Practically, it suggests pathways for improving automated emotion recognition systems, potentially impacting applications in security, psychology, and human-computer interaction domains.

Future Work and Considerations

Considering the challenges of small sample sizes inherent in micro-expression datasets, future work could focus on integrating data augmentation strategies to further enhance model robustness. Additionally, exploring alternative deep learning architectures or hybrid models could provide insights into optimizing micro-expression recognition systems.

In conclusion, the ELRCN framework represents a meaningful advancement in leveraging deep learning techniques for micro-expression recognition, showing promise in both specific and generalized dataset environments. Further exploration and fine-tuning could lead to even greater applicability and accuracy, thereby enhancing the practical deployment of these systems in real-world scenarios.