Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition (1905.04075v2)

Published 10 May 2019 in cs.CV

Abstract: Occlusion and pose variations, which can change facial appearance significantly, are two major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER has made substantial progresses in the past few decades, occlusion-robust and pose-invariant issues of FER have received relatively less attention, especially in real-world scenarios. This paper addresses the real-world pose and occlusion robust FER problem with three-fold contributions. First, to stimulate the research of FER under real-world occlusions and variant poses, we build several in-the-wild facial expression datasets with manual annotations for the community. Second, we propose a novel Region Attention Network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER. The RAN aggregates and embeds varied number of region features produced by a backbone convolutional neural network into a compact fixed-length representation. Last, inspired by the fact that facial expressions are mainly defined by facial action units, we propose a region biased loss to encourage high attention weights for the most important regions. We validate our RAN and region biased loss on both our built test datasets and four popular datasets: FERPlus, AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose. Our method also achieves state-of-the-art results on FERPlus, AffectNet, RAF-DB, and SFEW. Code and the collected test data will be publicly available.

Citations (555)

View on Semantic Scholar

Summary

The paper presents a Region Attention Network (RAN) that dynamically weights facial regions to improve recognition amid occlusions and pose variations.
It introduces a region-biased loss inspired by facial action units to direct focus towards crucial facial features.
Experimental validations on FERPlus, AffectNet, RAF-DB, and SFEW demonstrate state-of-the-art performance, achieving up to 89.16% accuracy on FERPlus.

Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition

The paper "Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition" presents a sophisticated approach to tackling the challenges of facial expression recognition (FER), particularly when dealing with varying poses and occlusions.

Problem Statement and Contributions

Facial expression recognition often encounters difficulties due to variations in facial pose and occlusions, such as sunglasses or masks, which significantly alter facial appearance. Despite advances in FER, these issues remain underexplored in real-world scenarios. This paper contributes to the field in several ways:

Dataset Annotation: The authors annotate multiple FER datasets with pose and occlusion attributes to aid research in real-world conditions. This improves the availability of challenging benchmark datasets.
Proposed Method - Region Attention Network (RAN): The introduction of the Region Attention Network is pivotal. RAN dynamically assesses the significance of various facial regions, thereby enhancing recognition performance under occlusions and pose variations. This approach contrasts with traditional CNN applications that process holistic face images without accounting for specific regional distortions.
Region Biased Loss: Inspired by facial action units, the authors propose a region biased loss function. This loss is specifically designed to prioritize significant facial regions in the attention mechanism, further boosting recognition accuracy.
Evaluation on Robust Datasets: RAN and the region biased loss were validated across several datasets, including FERPlus, AffectNet, RAF-DB, and SFEW. The experimental results confirm significant improvements, achieving state-of-the-art performance in FER under challenging conditions.

Methodological Insights

The RAN architecture is noteworthy for its ability to integrate regional features effectively. By using a backbone CNN, the network extracts features from various facial regions. The self-attention module calculates initial attention weights, while the relation-attention module refines these weights by considering interactions between regions. The aggregate results in a robust representation adapted to occluded and variably posed faces. This dual-stage refinement in attention distinguishes RAN from existing approaches.

The integration of the region biased loss (RB-Loss) is critical. By requiring that the most important facial region obtains higher attention, the RB-Loss subtly guides the training process to focus more on crucial features, resulting in improved accuracy without additional computational overhead.

Experimental Results and Implications

Extensive experiments demonstrate RAN's effectiveness, with significant improvements in accuracy, particularly under occluded and greatly posed scenarios. For instance, accuracies reached 89.16% on FERPlus and 59.5% on AffectNet, outperforming existing methods. These results showcase the practical implications of RAN in developing FER systems adaptable to real-world applications, such as driver fatigue monitoring or human-computer interaction in obscured environments.

Future Directions

The paper's insights open avenues for further exploration in AI-driven facial analysis systems. Future research could explore:

Real-time Implementation: Optimizing RAN for deployment in real-time applications.
Augmenting Diverse Data Sources: Expanding datasets to include more diverse occlusions and pose variations.
Integration with Other Modalities: Combining RAN with other data modalities, like audio, to enhance multimodal emotion recognition accuracy.

In summary, this work marks a significant advance in making FER robust to real-world challenges posed by occlusions and pose variability. The proposed Region Attention Network and associated methodologies offer a promising framework for future developments in artificial intelligence and human-computer interaction.