FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition (1609.06591v2)

Published 21 Sep 2016 in cs.CV

Abstract: Relatively small data sets available for expression recognition research make the training of deep networks for expression recognition very challenging. Although fine-tuning can partially alleviate the issue, the performance is still below acceptable levels as the deep features probably contain redun- dant information from the pre-trained domain. In this paper, we present FaceNet2ExpNet, a novel idea to train an expression recognition network based on static images. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully- connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases, CK+, Oulu-CASIA, TFD, and SFEW demonstrate that our method achieves better results than state-of-the-art.

Citations (355)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that uses pre-trained face recognition models to regularize and enhance facial expression recognition on small datasets.
It employs a novel regression loss on deep convolutional features, achieving up to 98.6% accuracy on CK+ and outperforming existing methods on other benchmarks.
The approach highlights the effective use of transfer learning to mitigate data scarcity, paving the way for applications in surveillance, robotics, and user interfaces.

FaceNet2ExpNet: Enhancing Facial Expression Recognition with Deep Learning

The paper, "FaceNet2ExpNet: Regularizing a Deep Face Recognition Net for Expression Recognition," proposes an innovative methodology that addresses the challenges associated with facial expression recognition using small datasets. The authors introduce FaceNet2ExpNet, a framework that harnesses the power of pre-trained face recognition networks to enhance expression recognition performance. This approach effectively tackles two significant issues: the low number of expression training images and the overfitting problems caused by the size discrepancy between face recognition and expression datasets.

Methodology and Approach

FaceNet2ExpNet encompasses a two-stage training process designed to capitalize on the robust feature extraction capabilities of deep convolutional neural networks (DCNNs). The first stage involves pre-training the convolutional layers of an expression recognition network, guided by a face recognition model. This is achieved by utilizing the face recognition model as a regularizer, a novel concept in which the distribution of high-level neurons in the expression network is modeled using information derived from the face model. This leads to a regression loss that aligns the expression network's features with those from the face net.

In the second stage, fully connected layers are appended to the pre-trained convolutional layers, followed by joint training of the entire network using label supervision. This method is designed to refine the network, enhancing its ability to capture discriminative expression features. The research finds that features extracted from a late middle convolutional layer, like pool5 in VGG-16, offer an optimal balance between supervision richness and feature discriminativeness.

Experimental Results

The approach is evaluated across four prominent public expression databases—CK+, Oulu-CASIA, TFD, and SFEW—all yielding better performance than contemporary solutions. For CK+, FaceNet2ExpNet achieved an accuracy of 98.6% for six expression classes and 96.8% for eight classes. In Oulu-CASIA, an 87.71% accuracy rate was observed, surpassing other existing methodologies. On the TFD dataset, the accuracy achieved was 88.9%, while for the more challenging SFEW dataset, the method attained a performance of 55.15% when complemented with additional training data. These results highlight the potential of this two-stage approach, not only in constrained scenarios but also in more diverse and uncontrolled environments.

Implications and Future Directions

The primary contribution of the paper lies in its capability to improve learning from small datasets, a common predicament in expression recognition. By leveraging features from a pre-trained, larger dataset (face recognition), the method promises improvements in domains where limited labeled data are available. This approach could potentially be extended beyond facial expressions to other areas in computer vision and machine learning where data scarcity is an issue.

Theoretically, this work advances the understanding of how pre-trained models in one domain can assist in learning tasks in another related domain. Practically, FaceNet2ExpNet could be instrumental in creating more effective and efficient facial expression recognition systems suitable for real-world applications in surveillance, robotics, and user-interface technologies.

Looking forward, future research may explore extending this model to asynchronous video-based expression recognition tasks or adapting it for real-time applications by investigating more lightweight architectures. Further, examining its applicability to other types of data representations such as audio or text could offer broader insights into multi-modal learning frameworks. Overall, FaceNet2ExpNet provides a substantive shift towards more intelligent utilization of pre-existing data models in AI, reinforcing the utility of transfer learning and fine-tuning techniques.

PDF Markdown