Face Behavior a la carte: Expressions, Affect and Action Units in a Single Network (1910.11111v3)

Published 15 Oct 2019 in cs.CV, cs.HC, cs.LG, and stat.ML

Abstract: Automatic facial behavior analysis has a long history of studies in the intersection of computer vision, physiology and psychology. However it is only recently, with the collection of large-scale datasets and powerful machine learning methods such as deep neural networks, that automatic facial behavior analysis started to thrive. Three of its iconic tasks are automatic recognition of basic expressions (e.g. happy, sad, surprised), estimation of continuous emotions (e.g., valence and arousal), and detection of facial action units (activations of e.g. upper/inner eyebrows, nose wrinkles). Up until now these tasks have been mostly studied independently collecting a dataset for the task. We present the first and the largest study of all facial behaviour tasks learned jointly in a single multi-task, multi-domain and multi-label network, which we call FaceBehaviorNet. For this we utilize all publicly available datasets in the community (around 5M images) that study facial behaviour tasks in-the-wild. We demonstrate that training jointly an end-to-end network for all tasks has consistently better performance than training each of the single-task networks. Furthermore, we propose two simple strategies for coupling the tasks during training, co-annotation and distribution matching, and show the advantages of this approach. Finally we show that FaceBehaviorNet has learned features that encapsulate all aspects of facial behaviour, and can be successfully applied to perform tasks (compound emotion recognition) beyond the ones that it has been trained in a zero- and few-shot learning setting.

PDF Abstract

An In-Depth Analysis of Joint Learning in Facial Behavior Recognition

The paper provides a thoroughly constructed research proposition addressing the integration and simultaneous training of facial behavior analysis tasks within a unified network, coined as FaceBehaviorNet. It proposes a framework wherein the tasks of recognizing basic facial expressions, detecting facial action units (AUs), and estimating continuous affect dimensions such as valence and arousal are seamlessly fused. This approach leverages deep learning architectures to handle these tasks concurrently rather than independently, purportedly leading to superior task performances.

The FaceBehaviorNet Framework

FaceBehaviorNet is presented as an end-to-end multi-task deep learning architecture. It uniquely combines three major tasks often studied independently in facial behavior research: expression classification, AU detection, and affective dimension estimation. This architecture utilizes convolutional layers from VGG-FACE, followed by fully connected layers that serve to output predictions of facial behaviors. The network benefits from a shared feature space in acquiring robust representations, tailored for concurrent task processing.

Methodologies

A novel aspect of this research is the introduction of coupling strategies, specifically co-annotation and distribution matching, to effectively interlink tasks. The co-annotation approach assigns or infers labels of one task to support others, promoting synergy in the learning process. Distribution matching aligns the predictions from different task outputs, ensuring consistency in the model’s understanding of facial behaviors.

The researchers utilized 5 million images from publicly available databases for in-the-wild scenarios, which spans diverse image conditions and demographic variations. Training was done in a holistic setup with strategically split batches to maintain balanced exposure to all tasks throughout the iterations.

Results and Evaluation

The paper presents compelling evidence that FaceBehaviorNet, under the co-annotation and distribution matching schemes, surpasses not only independent single-task networks but also state-of-the-art methods across multiple benchmark datasets. In datasets such as Aff-Wild, AffectNet, and DISFA, the network achieved notable increases in performance metrics like CCC for valence-arousal and F1 scores for AUs. Such results underscore the efficacy of the proposed multi-task learning approach over traditional independent task methodologies.

Implications and Future Directions

The implications of this research are manifold. Practically, it suggests that concurrent training on diverse facial behavior recognition tasks can produce generalized features beneficial for zero-shot and few-shot learning tasks, such as compound emotion recognition. Theoretically, the approach offers a model where complex, interdependent behaviors can be understood as a collective outcome of shared learning, rather than disjointed processes.

Future avenues for facial behavior analysis could explore further the latent relationships among different emotional states, experiment with other holistic task integration frameworks, or enhance handling of more complex compound expressions directly. Moreover, exploring task-relatedness from different psychological perspectives remains a rich area for continued investigation.

Overall, the paper’s contribution lends itself to both broadening the computational understanding of facial behavior and setting a notable precedent for integrative approaches in the domain of affective computing.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Dimitrios Kollias (48 papers)
Viktoriia Sharmanska (19 papers)
Stefanos Zafeiriou (137 papers)

Citations (196)

View on Semantic Scholar