FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs (2503.21457v1)

Published 27 Mar 2025 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.

Summary

The paper presents a comprehensive hierarchical dataset with 15,842 facial images and 49,919 VQA pairs to evaluate MLLMs in facial perception.
The paper employs rigorous manual annotation from 200 trained annotators across diverse views and question types to ensure high-quality labels.
The paper's evaluation shows that while models like GPT-4o perform robustly overall, significant gaps remain in recognizing nuanced facial attributes.

FaceBench: A Comprehensive Benchmark for Multi-View Multi-Level Facial Attribute Analysis

FaceBench provides a groundbreaking framework for evaluating Multimodal LLMs (MLLMs) in facial perception. It extends the current capabilities of these models by presenting a multi-view, multi-level benchmarking dataset with a hierarchical facial attribute structure, enhancing the scope of analysis possible in face perception tasks.

Hierarchical Facial Attribute Structure

FaceBench introduces a hierarchical structure that categorizes facial attributes across five distinct views: Appearance, Accessories, Surrounding, Psychology, and Identity. This categorization facilitates an in-depth analysis of facial perception by enabling the evaluation of both fundamental and nuanced facial features. Each view comprises multiple levels: coarse-grained Level 1 attributes such as "eyes" and "hair," intermediate Level 2 components like "pupil" or "earlobe," and fine-grained Level 3 distinctions based on size, color, shape, or type.

Figure 1: Hierarchical organization of facial attributes.

This detailed taxonomy allows for a robust analysis that parallels human perception, capturing the complexity and granularity necessary for comprehensive facial recognition tasks. By integrating over 210 attributes and 700 attribute values, the structure forms the backbone of the FaceBench dataset.

Dataset Collection and Annotation

FaceBench comprises 15,842 facial images, meticulously gathered from diverse datasets and incorporating a broad range of VQA (Visual Question Answering) pairs. This dataset includes images representing the view perspectives of Identity, Psychology, Appearance, Accessories, and Surrounding. It divides into a test set containing 49,919 VQA pairs and a dedicated training set with 23,841 pairs.

The dataset utilizes a variety of question types—True/False, Single-Choice, Multiple-Choice, and Open-Ended—developed from predefined templates to ensure comprehensive coverage of the hierarchical structure.

Figure 2: Question types and human annotation workflow for building our dataset.

The process involves rigorous manual annotation, ensuring high-quality labels through the use of 200 trained annotators. This extensive effort is aimed at reducing labeling errors and guaranteeing that annotations closely match human perception, bolstered by a multi-layered quality control process.

Evaluation of Multimodal Models

FaceBench extensively evaluates a range of cutting-edge MLLMs, measuring their ability to process and interpret facial attributes across the specified views and levels. Baseline tests on prominent models like GPT-4o and Face-LLaVA—a fine-tuned version of LLaVA—demonstrate considerable variability in their efficacy at recognizing different facial characteristics.

Figure 3: Samples from our FaceBench dataset.

Results and Insights

The results showcase that while certain commercial models such as GPT-4o deliver robust overall performance, significant gaps remain, particularly in recognizing nuanced and complex facial attributes. The Face-LLaVA model, although trained on a limited dataset, shows promising improvements over existing open-source frameworks by attaining competitive results against commercial models.

Tables of results underscore a diverse range of performance across multiple attribute views, revealing critical insights into the strengths and weakness of current MLLMs. These findings highlight the ongoing challenges in the field of facial perception, emphasizing the need for more advanced datasets and model architectures to achieve human-level understanding.

Conclusion

FaceBench represents a substantial advance in the evaluation of MLLMs for facial perception tasks. By incorporating a hierarchical, multi-view, multi-level approach to facial attributes, it provides a comprehensive platform for testing and enhancing model performance in this domain. Future enhancements based on the insights from FaceBench could steer the development of more sophisticated and human-like perceptual models, thereby pushing the boundaries of what is achievable in AI-driven facial recognition and analysis.