Multi-Task Learning of Lightweight Neural Networks for Facial Expression and Attribute Recognition
This paper presents an approach centered on multi-task learning for facial expression and attribute recognition, leveraging lightweight convolutional neural networks (CNNs). The research focuses on efficient models suitable for mobile and edge devices, tackling face identification and classification of attributes such as age, gender, and ethnicity. The paper highlights the necessity of fine-tuning these networks to efficiently predict facial expressions.
Methodology
The paper examines several lightweight architectures, specifically MobileNet, EfficientNet, and RexNet. The networks are trained sequentially, starting with face identification on the large VGGFace2 dataset, followed by fine-tuning for specific facial attributes and emotion recognition tasks. These models utilize multi-task learning to improve accuracy across different facial analytics tasks without requiring extensive retraining for each individual task.
Face images in the datasets are cropped based on precise regions obtained from face detectors, avoiding additional margins. This method reduces unnecessary background noise and enhances the focus on facial features. Training involves using various datasets, including AffectNet for emotion recognition, UTKFace for age, gender, and ethnicity classification, and AFEW and VGAF for video-based emotion recognition.
Experimental Results
The experiments demonstrate notable accomplishments:
- Facial Expression Recognition: Models such as EfficientNet-B2 show state-of-the-art accuracy on AffectNet with improvements in both 7 and 8-class emotion recognition tasks.
- Video-Based Emotion Classification: The proposed models achieve superior accuracy in video-based tasks on the AFEW and VGAF datasets. Specifically, EfficientNet-B0 outperforms existing methods for single video models on AFEW.
- Facial Attributes Recognition: The MobileNet-based model trained on the UTKFace dataset shows substantial improvement in gender and age prediction, outperforming models like DEX in accuracy and mean absolute error (MAE).
- Performance Efficiency: The trained models are characterized by low computational and memory requirements, demonstrating their suitability for mobile applications. For instance, MobileNet operates with minimal parameters, highlighting its potential for real-time applications.
Implications and Future Directions
The paper's findings suggest significant implications for deploying efficient neural networks in environments where computational resources are limited. The lightweight models maintain accuracy while reducing complexity, facilitating real-time facial analytics on mobile devices. This work emphasizes the importance of appropriate pre-training and fine-tuning to achieve robust facial recognition and attribute estimation across varied settings.
Future work can explore integrating more advanced classification techniques such as graph convolutional networks and transformers to further enhance accuracy. Investigating alternative data augmentation methods and incorporating multi-modal inputs, such as audio-visual data, might yield additional performance gains.
The research extends the repertoire of multi-task learning approaches in computer vision, particularly in facial analytics, by providing an effective balance between performance and efficiency. This positions the paper as a valuable resource for developing practical, scalable solutions in intelligent systems.