Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? (1510.02969v3)

Published 10 Oct 2015 in cs.CV, cs.LG, and cs.NE

Abstract: Despite being the appearance-based classifier of choice in recent years, relatively few works have examined how much convolutional neural networks (CNNs) can improve performance on accepted expression recognition benchmarks and, more importantly, examine what it is they actually learn. In this work, not only do we show that CNNs can achieve strong performance, but we also introduce an approach to decipher which portions of the face influence the CNN's predictions. First, we train a zero-bias CNN on facial expression data and achieve, to our knowledge, state-of-the-art performance on two expression recognition benchmarks: the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD). We then qualitatively analyze the network by visualizing the spatial patterns that maximally excite different neurons in the convolutional layers and show how they resemble Facial Action Units (FAUs). Finally, we use the FAU labels provided in the CK+ dataset to verify that the FAUs observed in our filter visualizations indeed align with the subject's facial movements.

Citations (263)

View on Semantic Scholar

Summary

The paper shows that zero-bias CNNs attain 95.1% and 88.6% accuracies on CK+ and TFD, respectively, by learning features corresponding to facial action units.
Using visualization techniques, the study reveals that specific CNN neurons consistently activate in response to facial regions linked to predefined action units.
Quantitative KL divergence analysis confirms a strong correlation between CNN filter activations and human-coded facial action unit labels.

Deep Neural Networks and Facial Action Units in Expression Recognition

The paper "Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition?" by Khorrami et al. presents a comprehensive paper examining the capability of convolutional neural networks (CNNs) to discern Facial Action Units (FAUs) within the context of facial expression recognition. The authors utilize CNNs as an appearance-based classifier to investigate whether these networks can match and potentially surpass existing benchmarks in expression recognition while simultaneously uncovering the specific facial components that guide their predictions.

The research leverages CNN architectures, specifically a zero-bias CNN configuration, that achieve, reportedly, state-of-the-art performance on two prominent facial expression datasets: the Extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD). The CK+ database contains 1308 images labeled with eight different expressions, while the TFD comprises 4178 images assigned with seven expression categories.

Significant results from the paper include:

Performance Enhancement: The CNN architecture utilized improves upon performance benchmarks for both CK+ and TFD datasets. When trained with data augmentation and dropout, the zero-bias CNN demonstrates significant accuracy improvements, reaching a performance of 95.1% on the CK+ dataset and an 88.6% accuracy on the TFD dataset.
Visualization of Learned Features: Leveraging visualization techniques inspired by Zeiler and Fergus, as well as Springenberg et al., the research qualitatively illustrates that certain neurons in the CNN tend to respond to specific regions of the face that align with predefined FAUs. This is significant as it aligns neural network responses with human-coded facial anatomy and muscle movements.
FAU and CNN Feature Correspondence: The paper evidences that CNN filters learn features analogous to FAUs through a quantitative examination using the FAU labels of the CK+ dataset. The correlation between filter activation and FAU presence is demonstrated through calculated KL divergences, providing quantifiable alignment between network responses and traditional action unit codes.

The implications of these findings suggest an expansion of the traditional scope of expression recognition, as CNNs not only enhance classification accuracy but also offer insight into the interpretability of model decisions via action unit correlation. This dual utility enhances the applicability of these models in interactive AI systems, where precise interpretation of human emotion is critical.

Potential future developments in artificial intelligence could explore the extension of emotion recognition systems beyond static images toward dynamic facial expressions in video, possibly integrating temporal memory mechanisms like LSTMs or GRUs to capture the subtleties of expression changes over time. Moreover, expanding the diversity of datasets and introducing biases reflective of varied demographic subsets could further refine the model's robustness across different applications and populations.

The paper by Khorrami et al., therefore, presents an important step in understanding and leveraging the capabilities of CNNs in not just identifying high-level categorical emotions, but also in illuminating the underlying facial action units that form the basis for these emotions, providing both a practical tool in emotion AI systems and a theoretical framework for future exploration.

PDF Markdown

Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition? (1510.02969v3)

Summary

Deep Neural Networks and Facial Action Units in Expression Recognition

Related Papers