Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age (2410.24148v1)

Published 31 Oct 2024 in cs.CV

Abstract: Technologies for recognizing facial attributes like race, gender, age, and emotion have several applications, such as surveillance, advertising content, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing demographic characteristics based on images and analyzing facial expressions have several challenges due to the complexity of humans' facial attributes. Traditional approaches have employed CNNs and various other deep learning techniques, trained on extensive collections of labeled images. While these methods demonstrated effective performance, there remains potential for further enhancements. In this paper, we propose to utilize vision LLMs (VLMs) such as generative pre-trained transformer (GPT), GEMINI, large language and vision assistant (LLAVA), PaliGemma, and Microsoft Florence2 to recognize facial attributes such as race, gender, age, and emotion from images with human faces. Various datasets like FairFace, AffectNet, and UTKFace have been utilized to evaluate the solutions. The results show that VLMs are competitive if not superior to traditional techniques. Additionally, we propose "FaceScanPaliGemma"--a fine-tuned PaliGemma model--for race, gender, age, and emotion recognition. The results show an accuracy of 81.1%, 95.8%, 80%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming pre-trained version of PaliGemma, other VLMs, and SotA methods. Finally, we propose "FaceScanGPT", which is a GPT-4o model to recognize the above attributes when several individuals are present in the image using a prompt engineered for a person with specific facial and/or physical attributes. The results underscore the superior multitasking capability of FaceScanGPT to detect the individual's attributes like hair cut, clothing color, postures, etc., using only a prompt to drive the detection and recognition tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1548–1558 (2021).
  2. Multi-task learning using uncertainty to weigh losses for heterogeneous face attribute estimation. arXiv preprint arXiv:2403.00561 (2024).
  3. Empirical analysis of multi-task learning for reducing identity bias in toxic comment detection. In Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, 683–693 (2020).
  4. Intra-class variation reduction using training expression images for sparse representation based facial expression recognition. IEEE Transactions on Affective Computing 5, 340–351 (2014).
  5. Caruana, R. Multitask learning. Machine learning 28, 41–75 (1997).
  6. Improving generalization ability of neural networks ensemble with multi-task learning. Journal of Computational Information Systems 2, 1235–1239 (2006).
  7. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence 40, 2597–2609 (2017).
  8. Mohammad, S. M. Ethics sheet for automatic emotion recognition and sentiment analysis. Computational Linguistics 48, 239–278 (2022).
  9. Sham, A. H. et al. Ethical ai in facial expression analysis: racial bias. Signal, Image and Video Processing 17, 399–406 (2023).
  10. Leslie, D. Understanding bias in facial recognition technologies. arXiv preprint arXiv:2010.07023 (2020).
  11. A deeper look at facial expression dataset bias. IEEE Transactions on Affective Computing 13, 881–893 (2020).
  12. Bias and fairness on multimodal emotion detection algorithms. arXiv preprint arXiv:2205.08383 (2022).
  13. Brown, T. et al. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020).
  14. Team, G. et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  15. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  16. Beyer, L. et al. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726 (2024).
  17. Xiao, B. et al. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4818–4829 (2024).
  18. Kevian, D. et al. Capabilities of large language models in control engineering: A benchmark study on gpt-4, claude 3 opus, and gemini 1.0 ultra. arXiv preprint arXiv:2404.03647 (2024).
  19. A survey of ai-based facial emotion recognition: Features, ml & dl techniques, age-wise datasets and future directions. Ieee Access 9, 165806–165840 (2021).
  20. Demszky, D. et al. Using large language models in psychology. Nature Reviews Psychology 2, 688–701 (2023).
  21. Ensemble and continual federated learning for classification tasks. Machine Learning 112, 3413–3453 (2023).
  22. Minaee, S. et al. Large language models: A survey. arXiv preprint arXiv:2402.06196 (2024).
  23. Li, C. et al. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends® in Computer Graphics and Vision 16, 1–214 (2024).
  24. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 815–823 (2015).
  25. Age group classification using convolutional neural network (cnn). In Journal of Physics: Conference Series, vol. 2084, 012028 (IOP Publishing, 2021).
  26. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
  27. Ai-generated faces free from racial and gender stereotypes. arXiv preprint arXiv:2402.01002 (2024).
  28. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), 67–74 (IEEE, 2018).
  29. Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR abs/1905.11946 (2019). URL http://arxiv.org/abs/1905.11946. 1905.11946.
  30. Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  31. Radford, A. et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763 (PMLR, 2021).
  32. Affectnet: a new database for facial expression, valence, and arousal computation in the wild. IEEE Trans. Affect. Comput.(99) 1–1 (2008).
  33. Representation learning and identity adversarial training for facial behavior understanding. arXiv preprint arXiv:2407.11243 (2024).
  34. Mao, J. et al. Poster++: A simpler and stronger facial expression recognition network. arXiv preprint arXiv:2301.12149 (2023).
  35. Classifying emotions and engagement in online learning based on a single facial expression recognition neural network. IEEE Transactions on Affective Computing 13, 2132–2143 (2022).
  36. Emotion separation and recognition from a facial expression by generating the poker face with vision transformers. arXiv preprint arXiv:2207.11081 (2022).
  37. Abdulwahid, A. A. Classification of ethnicity using efficient cnn models on morph and feret datasets based on face biometrics. Applied Sciences 13, 7288 (2023).
  38. Mello-Klein, C. Facebook’s ad delivery algorithm is discriminating based on race, gender and age in photos, northeastern researchers find. https://news.northeastern.edu/2022/10/25/facebook-algorithm-discrimination/ (2022).
  39. Sunitha, G. et al. Intelligent deep learning based ethnicity recognition and classification using facial images. Image and Vision Computing 121, 104404 (2022).
  40. Race estimation with deep networks. Journal of King Saud University-Computer and Information Sciences 34, 4579–4591 (2022).
  41. A classification of arab ethnicity based on face image using deep learning approach. IEEE Access 9, 50755–50766 (2021).
  42. AlBdairi, A. J. A. et al. Face recognition based on deep learning and fpga for ethnicity identification. Applied Sciences 12, 2605 (2022).
  43. Haseena, S. et al. Prediction of the age and gender based on human face images based on deep learning algorithm. Computational and Mathematical Methods in Medicine 2022, 1413597 (2022).
  44. Fayyaz, M. et al. Pedestrian gender classification on imbalanced and small sample datasets using deep and traditional features. Neural Computing and Applications 35, 11937–11968 (2023).
  45. Sonthi, V. K. et al. A deep learning technique for smart gender classification system. In 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), 983–987 (IEEE, 2023).
  46. Multimodal soft biometrics: combining ear and face biometrics for age and gender classification. Multimedia Tools and Applications 1–19 (2022).
  47. Age group and gender classification using convolutional neural networks with a fuzzy logic-based filter method for noise reduction. Journal of Intelligent & Fuzzy Systems 42, 491–501 (2022).
  48. Ciobotaru, A. et al. Comparing deep learning and genetic algorithms techniques for age and gender classification. In 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), 1–6 (IEEE, 2023).
  49. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
  50. Age classification using motif and statistical features derived on gradient facial images. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science) 13, 965–976 (2020).
  51. A hybrid deep learning cnn–elm for age and gender classification. Neurocomputing 275, 448–461 (2018).
  52. Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006).
  53. Deep learning-based facial emotion recognition for human–computer interaction applications. Neural Computing and Applications 35, 23311–23328 (2023).
  54. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  55. Emotion and gender classification using convolution neural networks. In ICT Systems and Sustainability: Proceedings of ICT4SD 2021, Volume 1, 563–573 (Springer, 2022).
  56. Gender and ethnicity recognition based on visual attention-driven deep architectures. Journal of Visual Communication and Image Representation 88, 103627 (2022).
  57. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence 118, 105651 (2023).
  58. What do you see? enhancing zero-shot image classification with multimodal large language models. arXiv preprint arXiv:2405.15668 (2024).
  59. Hello gpt-4o. https://openai.com/index/hello-gpt-4o// (2024).
  60. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  61. Yang, B. et al. Vision transformer-based visual language understanding of the construction process. Alexandria Engineering Journal 99, 242–256 (2024).
  62. Yildirim, N. et al. Multimodal healthcare ai: identifying and designing clinically relevant vision-language applications for radiology. In Proceedings of the CHI Conference on Human Factors in Computing Systems, 1–22 (2024).
  63. Vision-language models for medical report generation and visual question answering: A review. arXiv preprint arXiv:2403.02469 (2024).
  64. Antol, S. et al. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, 2425–2433 (2015).
  65. Gemini Team, G. Gemini 1.5 technical report. https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf (2024).
  66. Introducing gemini 1.5, google’s next-generation ai model. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#architecture (2024).
  67. Gpt-4o: The comprehensive guide and explanation. https://blog.roboflow.com/gpt-4o-vision-use-cases/ (2024).
  68. Llava: Large language and vision assistant explained. https://encord.com/blog/llava-large-language-vision-assistant/ (2024).
  69. Liu, H. et al. Llava-next: Improved reasoning, ocr, and world knowledge (2024). URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  70. Dang, P. Multimodal (visual and language) understanding with llava-next. https://rocm.blogs.amd.com/artificial-intelligence/llava-next/README.html (2023).
  71. Zhu, H. et al. Harnessing large vision and language models in agriculture: A review. arXiv preprint arXiv:2407.19679 (2024).
  72. Thomee, B. et al. Yfcc100m: The new data in multimedia research. Communications of the ACM 59, 64–73 (2016).
  73. Yang Song, Z. Z. Utkface dataset. https://susanqq.github.io/UTKFace/.
  74. AWS. Amazon rekognition. https://docs.aws.amazon.com/rekognition/latest/APIReference/API_AgeRange.html.
  75. Hugging face. https://huggingface.co/.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets