Facial expression and attributes recognition based on multi-task learning of lightweight neural networks (2103.17107v3)

Published 31 Mar 2021 in cs.CV

Abstract: In this paper, the multi-task learning of lightweight convolutional neural networks is studied for face identification and classification of facial attributes (age, gender, ethnicity) trained on cropped faces without margins. The necessity to fine-tune these networks to predict facial expressions is highlighted. Several models are presented based on MobileNet, EfficientNet and RexNet architectures. It was experimentally demonstrated that they lead to near state-of-the-art results in age, gender and race recognition on the UTKFace dataset and emotion classification on the AffectNet dataset. Moreover, it is shown that the usage of the trained models as feature extractors of facial regions in video frames leads to 4.5% higher accuracy than the previously known state-of-the-art single models for the AFEW and the VGAF datasets from the EmotiW challenges. The models and source code are publicly available at https://github.com/HSE-asavchenko/face-emotion-recognition.

PDF Abstract

Multi-Task Learning of Lightweight Neural Networks for Facial Expression and Attribute Recognition

This paper presents an approach centered on multi-task learning for facial expression and attribute recognition, leveraging lightweight convolutional neural networks (CNNs). The research focuses on efficient models suitable for mobile and edge devices, tackling face identification and classification of attributes such as age, gender, and ethnicity. The paper highlights the necessity of fine-tuning these networks to efficiently predict facial expressions.

Methodology

The paper examines several lightweight architectures, specifically MobileNet, EfficientNet, and RexNet. The networks are trained sequentially, starting with face identification on the large VGGFace2 dataset, followed by fine-tuning for specific facial attributes and emotion recognition tasks. These models utilize multi-task learning to improve accuracy across different facial analytics tasks without requiring extensive retraining for each individual task.

Face images in the datasets are cropped based on precise regions obtained from face detectors, avoiding additional margins. This method reduces unnecessary background noise and enhances the focus on facial features. Training involves using various datasets, including AffectNet for emotion recognition, UTKFace for age, gender, and ethnicity classification, and AFEW and VGAF for video-based emotion recognition.

Experimental Results

The experiments demonstrate notable accomplishments:

Facial Expression Recognition: Models such as EfficientNet-B2 show state-of-the-art accuracy on AffectNet with improvements in both 7 and 8-class emotion recognition tasks.
Video-Based Emotion Classification: The proposed models achieve superior accuracy in video-based tasks on the AFEW and VGAF datasets. Specifically, EfficientNet-B0 outperforms existing methods for single video models on AFEW.
Facial Attributes Recognition: The MobileNet-based model trained on the UTKFace dataset shows substantial improvement in gender and age prediction, outperforming models like DEX in accuracy and mean absolute error (MAE).
Performance Efficiency: The trained models are characterized by low computational and memory requirements, demonstrating their suitability for mobile applications. For instance, MobileNet operates with minimal parameters, highlighting its potential for real-time applications.

Implications and Future Directions

The paper's findings suggest significant implications for deploying efficient neural networks in environments where computational resources are limited. The lightweight models maintain accuracy while reducing complexity, facilitating real-time facial analytics on mobile devices. This work emphasizes the importance of appropriate pre-training and fine-tuning to achieve robust facial recognition and attribute estimation across varied settings.

Future work can explore integrating more advanced classification techniques such as graph convolutional networks and transformers to further enhance accuracy. Investigating alternative data augmentation methods and incorporating multi-modal inputs, such as audio-visual data, might yield additional performance gains.

The research extends the repertoire of multi-task learning approaches in computer vision, particularly in facial analytics, by providing an effective balance between performance and efficiency. This positions the paper as a valuable resource for developing practical, scalable solutions in intelligent systems.

PDF Markdown Bookmark Chat (Pro)

Authors (1)

Andrey V. Savchenko (17 papers)

Citations (118)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - av-savchenko/face-emotion-recognition: Efficient face emotion recognition in photos and videos (602 stars)