Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

HumaniBench: Human-Centric LMM Evaluation

Updated 4 September 2025
  • HumaniBench is a human-centric evaluation framework that uses 32,000 real-world image–question pairs to assess large multimodal models.
  • It systematically measures seven key alignment principles—including fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality—through diverse VQA, captioning, grounding, and resilience tasks.
  • The benchmark employs a hybrid AI-assisted and expert-validated annotation process, enabling comparative evaluation of both proprietary and open-source models for responsible AI deployment.

HumaniBench is a large-scale, human-centric evaluation framework for assessing the alignment and societal impact of large multimodal models (LMMs). The benchmark comprises 32,000 real-world image–question pairs drawn from news sources, each annotated through an AI-assisted and expert-validated process. HumaniBench systematically measures models’ performance across seven key alignment principles—fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality—using diverse open-ended and closed-ended visual question answering (VQA), captioning, visual grounding, and resilience tasks. It is specifically designed to reveal both strengths and limitations of LMMs along axes critical for responsible and trustworthy AI deployment in real-world contexts.

1. Evaluation Dimensions and Alignment Principles

HumaniBench operationalizes seven distinct human-centric principles:

  • Fairness: Assesses whether model responses treat individuals and groups equally, avoiding stereotypes and biases across demographic attributes (age, gender, race, occupation, sport).
  • Ethics: Measures harmlessness, legality, and avoidance of maleficence, including detection of prejudiced or unsafe content.
  • Empathy: Tests the model’s sensitivity to emotional cues, particularly in empathetic captioning tasks.
  • Inclusivity (Language): Evaluates cross-lingual parity, spanning both high-resource and low-resource languages.
  • Reasoning: Scores logical coherence, contextual understanding, and “hallucination” rates in model answers.
  • Robustness: Measures degradation of performance under adversarial perturbations or image distortions.
  • Understanding: Rates factuality and faithfulness in reporting visual contents without hallucinations.

Each evaluation task is mapped to one or several principles via specific metrics. Open-ended and closed-ended VQA are automatically judged on accuracy, contextual relevance, faithfulness, and hallucination rate, typically using a proprietary LLM (GPT-4o) as the judge.

2. Dataset Curation and Annotation Protocol

The HumaniBench dataset was constructed by collecting images from reputable news feeds (Google News RSS), filtering for uniqueness and safety, and curating approximately 13,000 unique images. The annotation protocol involves:

  • AI-Assisted Tagging: Initial captions, five social attributes (age, gender, race/ethnicity, occupation, sport), and reasoning-focused questions are generated via GPT‑4o.
  • Expert Validation: A multidisciplinary panel reviews, revises, and certifies labels and tags to ensure reliability for alignment evaluation.
  • Task Assignment: Each data point is assigned to one or more of seven evaluation components: Scene Understanding, Instance Identity, Multiple-Choice VQA, Multilinguality, Visual Grounding, Empathetic Captioning, and Image Resilience.

This hybrid pipeline balances scalability and annotation quality for reliable benchmarking.

3. Evaluation Methodology and Metrics

Multiple modalities and evaluation protocols are used:

  • Task Types: Open-ended VQA, closed-ended VQA (multiple-choice), visual grounding (object detection), multilingual VQA, empathetic captioning, and image-level adversarial perturbation resilience.
  • Scoring: Primary metrics include accuracy, bias score, hallucination rate, faithfulness, and contextual relevance; individual tasks receive composite scores (e.g., “AvgDet” for visual grounding, calculated as (mAP@0.5+mAP@0.75+100×IoU)/3([email protected] + [email protected] + 100 \times IoU)/3).
  • Annotation Validation: Automated GPT-4o scoring for open-ended tasks is cross-checked by human annotators and experts.
  • Comparative Tables: Model performance for each principle is tabulated, enabling cross-system comparison.

4. Comparative Benchmarking Results

Systematic benchmarking of 15 LMMs reveals differentiated strengths and weaknesses:

Model Class Reasoning/Ethics/Multilinguality Robustness/Grounding
Proprietary High Moderate
Open-source Moderate High

Proprietary models (GPT‑4o, Gemini 2.0 Flash) excel in reasoning, ethical judgment, and language parity, while open-source systems (Qwen2.5-7B, LLaVA‑v1.6) show improved robustness to adversarial image perturbations and superior object grounding. Most models, regardless of class, exhibit difficulty in maintaining the optimal balance between ethical alignment and raw predictive accuracy, as revealed by composite scores aggregating normalized metric values.

5. Techniques for Model Alignment Enhancement

HumaniBench demonstrates that the following methods substantially improve human-centric alignment:

  • Chain-of-Thought (CoT) Prompting: Step-by-step reasoning prompts yield a 2–4 percentage point increase in VQA accuracy and principled alignment.
  • Test-Time Scaling: Scaling model size (e.g., from 7B/11B to 32B/90B parameters) or modifying inference parameters improves both contextual understanding and reasoning.
  • Expert-Driven Annotation: Human validation of AI-generated tags and questions substantially increases reliability, critical for fair and ethical evaluation.

These techniques address model limitations in empathy, ethical judgment, and multilingual competence.

6. Societal Impact and Responsible AI

HumaniBench is grounded in real-world ethical requirements, explicitly seeking to ensure:

  • Equal treatment and avoidance of bias and stereotypes.
  • Preservation of factual integrity and prevention of hallucination.
  • Robustness across cultures and languages.
  • Transparent accountability for ethical risks in AI deployment (healthcare, public safety, governance).

The benchmark’s structure and public accessibility (datasets and code under CC BY-SA 4.0 and MIT licenses) provide an actionable and reproducible framework for model developers and researchers aiming to diagnose and rectify misalignments.

7. Reproducibility, Accessibility, and Technical Utility

All components of HumaniBench—data, evaluation protocols, expert annotation guidelines, and model prompts—are open-sourced, enabling reproducible and fair evaluations for future model iterations. The technical appendix includes:

  • Composite formulas (e.g., AvgDet) for task evaluation.
  • Annotated radar and bar charts for principle-wise model comparison.
  • Task mappings to quantitative and qualitative principles.

This infrastructure positions HumaniBench as an extensible, high-utility testbed for the assessment, development, and responsible deployment of multimodal models in human-centered applications.


HumaniBench establishes a precedent for evaluating large multimodal models on societal criteria, robust reasoning, and alignment with human-centered values (Raza et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)