Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation (2403.02302v3)
Abstract: Multimodal LLMs (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT-4V and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized task of age and gender estimation with our state-of-the-art specialized model, MiVOLO. We also updated MiVOLO and provide details and new metrics in this article. This comparison has yielded some interesting results and insights about the strengths and weaknesses of the participating models. Furthermore, we attempted various ways to fine-tune the ShareGPT4V model for this specific task, aiming to achieve state-of-the-art results in this particular challenge. Although such a model would not be practical in production, as it is incredibly expensive compared to a specialized model like MiVOLO, it could be very useful in some tasks, like data annotation.
- Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc., 2022.
- Human age estimation using ranking svm. In Wei-Shi Zheng, Zhenan Sun, Yunhong Wang, Xilin Chen, Pong C. Yuen, and Jianhuang Lai, editors, Biometric Recognition, pages 324–331, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
- Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140:325–331, 2020.
- Cross-age reference coding for age-invariant face recognition and retrieval. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 768–783, Cham, 2014. Springer International Publishing.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- How good is chatgpt at face biometrics? a first look into recognition, soft biometrics, and explainability, 2024.
- Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security, 9(12):2170–2179, 2014.
- Guiding instruction-based image editing via multimodal large language models. In International Conference on Learning Representations (ICLR), 2024.
- Open-vocabulary object detection via vision and language knowledge distillation. 2021.
- Hierarchical attention-based age estimation and bias estimation, 2021.
- Semi-supervised classification with graph convolutional networks, 2016.
- Mivolo: Multi-input transformer for age and gender estimation. 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. 2018.
- Lisa: Reasoning segmentation via large language model, 2023.
- Age and gender classification using convolutional neural networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 34–42, 2015.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Ordinalclip: Learning rank prompts for language-guided ordinal regression. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 35313–35325. Curran Associates, Inc., 2022.
- Fp-age: Leveraging face parsing attention for facial age estimation in the wild. arXiv, 2021.
- Improved baselines with visual instruction tuning, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Visual instruction tuning. In NeurIPS, 2023.
- Agenet: Deeply learned regressor and classifier for robust apparent age estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, December 2015.
- Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18030–18040, 2022.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
- OpenAI. ChatGPT: A Large Language Model. Online; accessed February 13, 2024, 2023. Available at https://www.openai.com/.
- OpenAI. Gpt-4 technical report, 2023.
- A call to reflect on evaluation practices for age estimation: Comparative analysis of the state-of-the-art and a unified benchmark, 2023.
- Kosmos-2: Grounding multimodal large language models to the world, 2023.
- Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061–18070, 2022.
- Andrey V. Savchenko. Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output cnn. 2018.
- Andrey V. Savchenko. Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), pages 119–124, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Moving window regression: A novel approach to ordinal regression. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18739–18748, 2022.
- Czl-ciae: Clip-driven zero-shot learning for correcting inverse age estimation, 2023.
- Masked contrastive graph representation learning for age estimation. ArXiv, abs/2306.17798, 2023.
- Llama: Open and efficient foundation language models. 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Improving face-based age estimation with attention-based dynamic patch fusion. IEEE Transactions on Image Processing, 31:1084–1096, 2022.
- Learning-to-rank meets language: Boosting language-driven ordering alignment for ordinal classification, 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks, 2023.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- C3ae: Exploring the limits of compact model for age estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12579–12588, 2019.
- Quantifying facial age by posterior of age comparisons, 2017.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.