Going Deeper in Facial Expression Recognition using Deep Neural Networks (1511.04110v1)

Published 12 Nov 2015 in cs.NE and cs.CV

Abstract: Automated Facial Expression Recognition (FER) has remained a challenging and interesting problem. Despite efforts made in developing various methods for FER, existing approaches traditionally lack generalizability when applied to unseen images or those that are captured in wild setting. Most of the existing approaches are based on engineered features (e.g. HOG, LBPH, and Gabor) where the classifier's hyperparameters are tuned to give best recognition accuracies across a single database, or a small collection of similar databases. Nevertheless, the results are not significant when they are applied to novel data. This paper proposes a deep neural network architecture to address the FER problem across multiple well-known standard face datasets. Specifically, our network consists of two convolutional layers each followed by max pooling and then four Inception layers. The network is a single component architecture that takes registered facial images as the input and classifies them into either of the six basic or the neutral expressions. We conducted comprehensive experiments on seven publically available facial expression databases, viz. MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013. The results of proposed architecture are comparable to or better than the state-of-the-art methods and better than traditional convolutional neural networks and in both accuracy and training time.

Authors (3)

Ali Mollahosseini (11 papers)
David Chan (24 papers)
Mohammad H. Mahoor (35 papers)

Citations (827)

View on Semantic Scholar

Summary

Going Deeper in Facial Expression Recognition using Deep Neural Networks

The paper "Going Deeper in Facial Expression Recognition using Deep Neural Networks" by Ali Mollahosseini, David Chan, and Mohammad H. Mahoor presents a novel deep neural network (DNN) architecture for automated facial expression recognition (FER). By leveraging advancements in convolutional neural networks (CNNs) and specifically the Inception layer architecture, the proposed method addresses key challenges in FER, notably the poor generalizability of traditional approaches when applied to novel, unseen images.

Existing FER methods relying on hand-engineered features such as HOG, LBPH, and Gabor filters have shown significant limitations in generalizing across different datasets, particularly those captured in uncontrolled, real-world settings. The DNN architecture proposed in this paper aims to mitigate these limitations by introducing a deeper, more complex network structure capable of learning inherently robust features from diverse datasets.

Methodology

The proposed architecture consists of several layers:

Two Convolutional Layers: Each followed by max pooling.
Four Inception Layers: These layers employ multi-scale convolutions of sizes 1x1, 3x3, and 5x5 in parallel, facilitating improved feature extraction at multiple scales.
Fully Connected Layers: Two fully connected layers at the network's top layer stack act as classifiers.

The network architecture is designed to take registered facial images as input and classify them into one of six basic expressions (anger, disgust, fear, happiness, sadness, and surprise) or a neutral expression. The implementation uses the Caffe toolbox, and extensive experiments were conducted on seven publicly available facial expression databases: MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013.

Experimental Results

Experiments were performed in both subject-independent and cross-database manners. In the subject-independent setting, the network demonstrated accuracies comparable to, or better than, state-of-the-art methods across multiple databases:

MultiPIE: Achieved an accuracy of 94.7%.
MMI: Achieved 77.6%.
CK+: Achieved 93.2%.
FER2013: Achieved 66.4%.

Results on cross-database evaluations, where the model is trained on one set of databases and tested on another, reflect the network's robust generalizability:

CK+: Achieved 64.2% when trained on other databases.
FER2013: Achieved 34.0%.

These results signify the architecture’s enhanced capability to generalize across different datasets compared to traditional CNNs and other methods whose classifier parameters are often fine-tuned for specific datasets.

Implications and Future Directions

The use of Inception layers within the proposed DNN architecture allows for a deeper, more complex model without prohibitive increases in computational demands. This aspect ensures the network can learn features generalizing well to new scenarios, validating the theoretical and practical benefits of deep sparse networks manifested through approximations such as Inception modules. The resistance to overfitting due to increased depth and breadth without substantial computational overhead is a vital perspective for future FER implementations.

Future developments in this domain may explore:

Enhanced face registration techniques to improve the initial preprocessing step significantly.
Adopting unsupervised learning methods to solve FER in purely wild settings where labeled data may be scarce or inconsistent.
Integration of multimodal data (e.g., audio-visual) to further bolster FER accuracies in complex real-world scenarios.

In conclusion, the proposed deep neural network architecture presents a substantial advancement in the FER field, offering promising accuracy and generalizability attributes. The combination of traditional CNN layers with advanced Inception modules sets a new benchmark for effective and efficient FER suitable for a range of applications in human-computer interaction and beyond.

PDF Markdown

Related Papers

Find Related Papers