Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network (1902.01019v1)

Published 4 Feb 2019 in cs.CV

Abstract: Facial expression recognition has been an active research area over the past few decades, and it is still challenging due to the high intra-class variation. Traditional approaches for this problem rely on hand-crafted features such as SIFT, HOG and LBP, followed by a classifier trained on a database of images or videos. Most of these works perform reasonably well on datasets of images captured in a controlled condition, but fail to perform as good on more challenging datasets with more image variation and partial faces. In recent years, several works proposed an end-to-end framework for facial expression recognition, using deep learning models. Despite the better performance of these works, there still seems to be a great room for improvement. In this work, we propose a deep learning approach based on attentional convolutional network, which is able to focus on important parts of the face, and achieves significant improvement over previous models on multiple datasets, including FER-2013, CK+, FERG, and JAFFE. We also use a visualization technique which is able to find important face regions for detecting different emotions, based on the classifier's output. Through experimental results, we show that different emotions seems to be sensitive to different parts of the face.

Citations (510)

View on Semantic Scholar

Summary

The paper presents an attentional convolutional network that significantly improves facial expression recognition by concentrating on critical facial regions.
It employs a spatial transformer network within a shallow architecture, achieving 70.02% accuracy on FER-2013 and 99.3% on the FERG dataset.
The study effectively addresses intra-class variation and partial faces, offering promising advancements for HCI, animation, and security applications.

Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network

The paper "Deep-Emotion: Facial Expression Recognition Using Attentional Convolutional Network" by Shervin Minaee and Amirali Abdolrashidi introduces an innovative approach to facial expression recognition (FER), leveraging attentional convolutional networks (ACNs). The paper addresses challenges related to intra-class variation and partial faces common in FER, offering a significant enhancement in accuracy across multiple challenging datasets.

Overview

Traditional FER methodologies depend on hand-crafted features such as SIFT, HOG, and LBP, typically followed by a classifier. While these methods perform adequately under controlled conditions, their efficacy diminishes under more variable, real-world circumstances. Recent trends have shifted towards deep learning, exploiting end-to-end frameworks to overcome these limitations.

The proposed ACN framework aims to enhance FER by concentrating on the most salient regions of the face. This attention mechanism facilitates more precise emotion recognition by identifying critical facial areas, like the mouth and eyes, that are more indicative of specific emotions, and discounting less significant areas like the ears and hair.

Experimental Analysis and Results

The paper provides comprehensive evaluations using significant datasets: FER-2013, CK+, FERG, and JAFFE. Notably, the method achieves an accuracy of 70.02% on the FER-2013 dataset, outperforming prior deep learning approaches such as those utilizing VGG-SVM and GoogleNet. On the FERG dataset, the approach achieves 99.3%, showcasing its robustness even with stylized characters. The proposed model also demonstrates promising results on CK+ and JAFFE datasets with accuracy rates of 98.0% and 92.8%, respectively.

Methodology

The attentional mechanism is integrated through a spatial transformer network (STN), which enhances the model's ability to attend to significant facial features. The architecture comprises a relatively shallow network with less than 10 layers, defying the common trend of leveraging deeper models for improved accuracy. By optimizing a simple loss function comprising classification loss and weight regularization, the model remains robust even when trained from scratch on smaller datasets.

Implications and Future Directions

The application of attention mechanisms in FER signifies a step towards more refined models that prioritize facial zones relevant to emotion detection. The paper's findings could advance FER systems used in diverse applications like human-computer interaction, animation, and security. Furthermore, it raises intriguing possibilities for future exploration, such as integrating more complex attention structures or applying this framework to other domains beyond emotion recognition. The incorporation of spatiotemporal dynamics and multi-modal data sources might also enhance recognition accuracy, providing a fertile ground for further investigation.

Conclusion

The attentional convolutional network approach articulated in this paper exemplifies a promising avenue for overcoming the variability and complexity inherent in facial expression recognition. By focusing computational resources on the most expressive regions of the face, this method not only achieves high accuracy across various datasets but also paves the way for future advancements in the field of artificial intelligence and machine vision.