Local Learning with Deep and Handcrafted Features for Facial Expression Recognition (1804.10892v7)

Published 29 Apr 2018 in cs.CV

Abstract: We present an approach that combines automatic features learned by convolutional neural networks (CNN) and handcrafted features computed by the bag-of-visual-words (BOVW) model in order to achieve state-of-the-art results in facial expression recognition. To obtain automatic features, we experiment with multiple CNN architectures, pre-trained models and training procedures, e.g. Dense-Sparse-Dense. After fusing the two types of features, we employ a local learning framework to predict the class label for each test image. The local learning framework is based on three steps. First, a k-nearest neighbors model is applied in order to select the nearest training samples for an input test image. Second, a one-versus-all Support Vector Machines (SVM) classifier is trained on the selected training samples. Finally, the SVM classifier is used to predict the class label only for the test image it was trained for. Although we have used local learning in combination with handcrafted features in our previous work, to the best of our knowledge, local learning has never been employed in combination with deep features. The experiments on the 2013 Facial Expression Recognition (FER) Challenge data set, the FER+ data set and the AffectNet data set demonstrate that our approach achieves state-of-the-art results. With a top accuracy of 75.42% on FER 2013, 87.76% on the FER+, 59.58% on AffectNet 8-way classification and 63.31% on AffectNet 7-way classification, we surpass the state-of-the-art methods by more than 1% on all data sets.

Citations (241)

View on Semantic Scholar

Summary

The paper demonstrates that a hybrid approach combining deep CNN and handcrafted features yields state-of-the-art results.
It employs a local learning framework with k-NN based SVM re-training to capture nuanced spatial and expression-specific details.
The method outperforms baselines on FER 2013, FER+, and AffectNet, significantly enhancing classification accuracy.

Local Learning with Deep and Handcrafted Features for Facial Expression Recognition

The paper "Local Learning with Deep and Handcrafted Features for Facial Expression Recognition" presents a hybrid approach by combining automatic features obtained through deep learning with handcrafted features for enhanced facial expression recognition, producing state-of-the-art results across several benchmark datasets. This work stands out due to its effective integration of convolutional neural network (CNN) based features with handcrafted representations and the employment of a local learning framework. This strategy optimizes classification accuracy by capturing spatial and nuanced facial expressions more effectively than conventional singular model approaches.

Convolutional neural networks have consistently demonstrated their potential in extracting automatic features for various image recognition tasks. This paper explores and compares several CNN architectures, namely VGG-face, VGG-f, and VGG-13, which either commence as pre-trained models or are fine-tuned specifically for the task at hand. A noteworthy training framework, Dense-Sparse-Dense (DSD), is utilized to enhance CNN models, effectively pruning weight parameters and refining deep feature extraction.

Handcrafted features, generated through the Bag-of-Visual-Words (BOVW) model, introduced an additional layer of discriminative power. By employing dense SIFT descriptors, and encoding spatial information through a spatial pyramid model, the approach captures detailed spatial visual cues absent from some automatic feature representations.

The fusion of deep and handcrafted features is achieved through concatenation, followed by normalization, forming a comprehensive feature vector. The feature vectors are then classified using Support Vector Machines (SVMs), both globally and in a locality-sensitive paradigm. Notably, this fusion leads to significant improvements in classification accuracy. The local learning framework, which applies a k-nearest neighbors (k-NN) criterion to select neighborhood training samples for dynamic SVM training, is central to the improved performance. Consequently, this results in a non-linear decision boundary tailored to each test image, addressing variances that standard global classifiers might overlook.

The empirical evaluation of this hybrid methodology against three benchmark datasets—FER 2013, FER+, and AffectNet, highlights its effectiveness. The method manifests an accuracy of 75.42% on FER 2013, 87.76% on FER+, and outperforms baseline models with 59.58% eight-way classification on AffectNet when using the local SVM method. Strong numerical improvements, with more than one percentage point over leading state-of-the-art results in all cases, represent a compelling case for the utility of this combined approach.

These results not only underline this method's success but highlight the practical implications for enhancing automated systems analyzing human emotions across sectors such as health informatics, human-computer interaction, and behavioral sciences. The amalgamation of deep and handcrafted features with a local learning strategy addresses several challenges, significantly enhancing model adaptability and generalization in real-world settings.

Future research avenues may explore extending this method into video-based emotion recognition, thus addressing temporal dynamics of expression recognition. Moreover, adapting the model to handle domain-specific tasks, such as distinguishing between voluntary and involuntary facial expressions, presents an intriguing direction to fortify its applicability and robustness in more nuanced contexts of affective computing.

PDF Markdown

Local Learning with Deep and Handcrafted Features for Facial Expression Recognition (1804.10892v7)

Summary

Local Learning with Deep and Handcrafted Features for Facial Expression Recognition

Related Papers