HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation (2505.11454v3)

Published 16 May 2025 in cs.CV and cs.AI

Abstract: Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-LLMs, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.

Collections

Summary

The paper introduces the HumaniBench benchmark, a novel framework evaluating large multimodal models on fairness, ethics, and empathy.
It employs a dataset of 32,000 real-world image-question pairs, leveraging AI-assisted annotation with expert verification to ensure quality.
Results reveal that while proprietary models excel in fairness and multilinguality, none dominates all metrics, highlighting systemic challenges in LMM evaluation.

A Human-Centric Framework for Large Multimodal Models Evaluation

The paper introduces HumaniBench, a novel benchmark aimed at providing a comprehensive evaluation framework for Large Multimodal Models (LMMs) through a human-centric lens. It focuses on critical alignment principles such as fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, particularly in tasks like visual question answering (VQA), image captioning, and grounding.

Overview of HumaniBench

HumaniBench is constructed on a dataset of 32,000 real-world image-question pairs and an evaluation suite leveraging AI-assisted annotation, complemented by expertise verification.

Figure 1: Performance breakdown of different LMMs across various tasks and social attributes.

This pipeline ensures the benchmark's alignment with real-world values by addressing dimensions such as discrimination (fairness), ethical content handling, logical soundness, and cultural-linguistic inclusion. The top panel of the overview shows an AI-assisted annotation pipeline followed by domain-expert verification, and the bottom panel depicts the evaluation workflow.

Figure 2: HumaniBench Overview. The top panel illustrates our AI-assisted annotation pipeline, followed by domain-expert verification.

Methodology and Evaluation Setup

Key Evaluation Principles

HumaniBench evaluates LMMs on seven principles, including fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness, each grounded in normative guidelines like the EU Trustworthy AI framework.

Fairness involves bias minimization and equal treatment across demographic groups.
Ethics evaluates alignment with moral norms, focusing on non-maleficence and autonomy.
Understanding captures the fidelity of model outputs to visual and textual inputs, critically assessing hallucinations.
Reasoning appraises contextual coherence and logical consistency.
Language Inclusivity measures cross-linguistic performance and cultural sensitivity.
Empathy assesses affective resonance in responses, crucial in human-centric interactions.
Robustness tests resilience under adversarial conditions and data perturbations.

Dataset Curation and Task Design

HumaniBench’s dataset consists of images and metadata sourced from diverse news outlets. Using GPT-4o, images are annotated with captions and social attributes—age, gender, race, occupation, and sport—that guide question and answer generation.

Figure 3: AI-assisted pipeline illustrates the curation process from news websites, focusing on social attributes verification by experts.

Validation and Annotation

Expert-verified annotations ensure consistency and ethical alignment. The dataset underwent rigorous manual reviews, prioritizing high-quality annotations over quantity, ensuring reliable results in model evaluations.

Benchmarking Results

HumaniBench evaluates 15 state-of-the-art LMMs across the defined principles. Two proprietary models, GPT-4o and Gemini Flash 2.0, generally lead in reasoning, fairness, and multilinguality, whereas open-source models perform better in robustness and grounding.

Performance and Insights

The comprehensive task evaluation results in stark insights: While proprietary models excel in ethical and inclusive behavior, no single model leads across all dimensions, revealing systemic alignment gaps in LMMs.
Across social attributes, significant variances are observed. Race and Sports-associated attributes exhibit higher performance, while occupational cues remain deficient, indicating areas for improvement in fairness and representation.
Figure 4: Comprehensive performance evaluation across different tasks, highlighting model capabilities in scene understanding and multiple-choice question answering.

Discussion

Multilingual and Robustness Challenges

Multilingual tasks reveal persistent performance gaps; high-resource languages score higher than low-resource ones, highlighting the need for better cross-linguistic parity in LMMs. Robustness evaluation shows considerable declines in accuracy under perturbations, calling for enhanced model resilience.

Figure 5: Multilingual and robustness performance across various languages and visual grounding tasks.

Conclusion

HumaniBench establishes a robust framework for human-centric LMM evaluation, emphasizing the importance of aligning models with ethical and inclusive values. Future work will explore expanded datasets and refined multilingual and empathetic capabilities to enhance this alignment.

The benchmark’s public release, including data and code, aims to foster collaborative research toward more responsible LMM development, shaping a positive societal impact.

References

For a comprehensive list of references, please refer to the original paper's bibliography.