Facial Emotion Recognition: State of the Art Performance on FER2013 (2105.03588v1)

Published 8 May 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Facial emotion recognition (FER) is significant for human-computer interaction such as clinical practice and behavioral description. Accurate and robust FER by computer models remains challenging due to the heterogeneity of human faces and variations in images such as different facial pose and lighting. Among all techniques for FER, deep learning models, especially Convolutional Neural Networks (CNNs) have shown great potential due to their powerful automatic feature extraction and computational efficiency. In this work, we achieve the highest single-network classification accuracy on the FER2013 dataset. We adopt the VGGNet architecture, rigorously fine-tune its hyperparameters, and experiment with various optimization methods. To our best knowledge, our model achieves state-of-the-art single-network accuracy of 73.28 % on FER2013 without using extra training data.

Authors (2)

Yousif Khaireddin (2 papers)
Zhuofa Chen (6 papers)

Citations (129)

View on Semantic Scholar

Summary

The paper achieved state-of-the-art 73.28% accuracy on FER2013 by refining a single CNN architecture with extensive data augmentation and hyperparameter tuning.
It demonstrates that SGD with Nesterov momentum outperforms other optimizers, reaching 73.5% validation accuracy under rigorous testing.
The study employs effective learning rate schedulers and saliency maps to enhance model convergence and provide clear visual insights into key facial features.

Facial Emotion Recognition: Single-Network Performance on FER2013

The paper presents a rigorous paper on facial emotion recognition (FER) employing convolutional neural networks (CNNs), particularly using the VGGNet architecture. It aims to address the challenges faced in FER, such as facial pose variations and lighting differences, by optimizing model parameters to achieve a state-of-the-art performance on the FER2013 dataset.

Research Focus

The research focuses on improving classification accuracy on the FER2013 dataset using CNNs. The dataset, presented in ICML 2013, comprises 35,888 images representing seven emotions: anger, neutrality, disgust, fear, happiness, sadness, and surprise. Addressing the complexity of naturalistic conditions characterized by high intra-class variation, the authors refine the VGGNet architecture, employ diverse optimization algorithms, and explore multiple learning rate schedulers.

Methodology

The authors employ various key techniques to enhance model performance:

Data Augmentation: Implemented strategies like rescaling, shifting, and rotating images to enhance variability in the training set.
Optimization Techniques: Evaluated six optimization algorithms, including SGD with Nesterov Momentum, Adam, and Adagrad, to analyze their impact.
Learning Rate Schedulers: Tested different schedulers like Reduce Learning Rate on Plateau and Cosine Annealing to identify the most effective for convergence.
Model Tuning: Performed grid search for hyperparameters and conducted fine-tuning to polish model precision further.

Results

The paper achieved a single-network classification accuracy of 73.28% on the FER2013 dataset, surpassing previous benchmarks. Key results include:

Optimizer Performance: The SGD with Nesterov momentum showed superior results, achieving the highest validation accuracy of 73.5%.
Learning Rate Schedule: Reduce Learning Rate on Plateau outperformed other schedulers, ensuring a balanced learning rate adjustment.
Saliency Maps: Enabled visualization of important features contributing to emotion classification, emphasizing critical facial features while disregarding irrelevant information like backgrounds.

Implications and Future Directions

This research demonstrates significant progress in FER under challenging conditions with a single CNN architecture. The results affirm the efficacy of combining thorough hyperparameter tuning with sophisticated learning techniques to refine FER models. The application of such systems could positively influence sectors requiring nuanced human-computer interaction understanding, notably in clinical and behavioral fields.

Future work could delve into advanced image processing techniques and explore ensemble models combining varied architectures for further accuracy improvements. Additionally, refining model interpretability via enhanced visualization methods, such as improved saliency maps, could provide deeper insights into model decision processes.

Conclusion

The paper illustrates meticulous optimization and systematic exploration of CNNs on the FER2013 dataset, setting new standards in single-network FER tasks. The outlined methodologies and achieved results offer a robust foundation for advancing FER research, paving the way for more refined and reliable human-computer interaction technologies.

PDF Markdown