MMHal-Bench: Multimodal Hallucination Benchmark
- The paper introduces MMHal-Bench, which systematically evaluates hallucinations in large multimodal models by aligning textual outputs with visual evidence.
- MMHal-Bench comprises 96 image-question pairs spanning 12 object topics and eight error categories, enabling granular analysis of modality misalignment.
- The benchmark combines automated GPT-4 and human evaluations to deliver overall scores, hallucination rates, and category-specific performance metrics.
MMHal-Bench is an evaluation benchmark explicitly designed to assess and penalize hallucinations produced by large multimodal models (LMMs) in high-stakes, vision-language tasks. Unlike conventional benchmarks, which tend to evaluate helpfulness or correctness broadly, MMHal-Bench targets the core challenge of grounding textual outputs in visual evidence. Hallucination, in this context, refers to model outputs that introduce information not present in, or contradictory to, the corresponding image, often undermining the reliability of multimodal AI systems.
1. Purpose and Rationale
MMHal-Bench was developed to systematically measure the ability of LMMs to produce factually grounded, non-hallucinatory content. Existing datasets and evaluation schemes rarely focus on the specific failure mode of hallucination, especially when it arises from modality misalignment. MMHal-Bench addresses this gap by concentrating the evaluation on scenarios where models are likely to produce factually unsupported assertions, thus providing a robust metric for real-world reliability.
A distinctive feature is the adversarial construction and filtering of questions so that established LMMs (such as the original LLaVA model) are predisposed to hallucinate on these queries. This approach confronts models with cases that are known to be empirically challenging and operationally consequential for applications such as visual question answering or medical image annotation.
2. Dataset Construction and Question Typology
MMHal-Bench consists of 96 image-question pairs, carefully devised to span 12 common object topics sourced from the COCO object meta-categories. The images are exclusively taken from the validation and test splits of OpenImages to prevent any overlap with prevalent training distributions. Each question was authored to be novel and to target specific hallucination categories, maximizing the likelihood of exposing ungrounded model behaviors.
The benchmark encompasses eight distinct error typologies:
Category | Description |
---|---|
Attribute | Errors about object properties (e.g., color, size) |
Adversarial Object | Spurious or nonexistent objects introduced |
Comparison | Misstatements in comparative context |
Counting | Numerosity errors |
Spatial Relation | Mislabeling of object locations or orientations |
Environmental | Inferences about weather, location, etc. |
Holistic | Global scene misinterpretations |
Other | Miscellaneous (e.g., text/icon reading errors) |
This structure enables granular breakdowns of model performance across specific hallucination phenomena, rather than reducing the output quality to a singular aggregate score.
3. Evaluation Protocol
MMHal-Bench uses a dual-evaluator protocol integrating both automated and human assessment. For each instance, ground-truth answers and detailed image annotations are supplied alongside the model's response. GPT-4 serves as the primary automated evaluator by comparing model outputs to factual references and adjudicating the presence, type, and severity of hallucinations.
In parallel, human judgments are procured through Amazon Mechanical Turk, underpinned by rigorous instructions and reference examples to ensure consistent criteria. Human ratings are combined with the outputs of the automated system to further calibrate and validate hallucination metrics.
The evaluation returns the following metrics for each model:
- Overall Score: Summary performance indicator where higher values indicate superior alignment and factuality.
- Hallucination Rate: Proportion of outputs marked as hallucinated; lower values are favorable.
- Category Scores: Granular ratings across the eight error types.
4. Benchmark Results and Comparative Analysis
Empirical results on MMHal-Bench underscore the importance of targeted alignment strategies. For instance, LLaVA-RLHF₁₃B (13B parameter scale, RLHF-aligned) reaches an overall score of 2.53, outperforming several baselines. This model scores 3.33 in the “Attribute” category and 2.67 in “Adversarial,” with a competitive hallucination rate.
When compared against established baselines such as InstructBLIP, IDEFIC, and Kosmos-2, models trained with Reinforcement Learning from Human Feedback (RLHF)—specifically the factually augmented variant—demonstrate markedly lower hallucination rates while preserving response quality. This empirical evidence substantiates the benchmark's efficacy in distinguishing models on the axis of factual grounding.
5. Factually Augmented RLHF Integration
A critical component of the associated research is the integration of Factually Augmented RLHF (Fact-RLHF). Unlike standard RLHF, Fact-RLHF enriches the reward model with additional factual features, such as ground-truth image captions or rationales, mitigating the risk of “reward hacking”—where superficial but verbose outputs may achieve high reward despite factual inaccuracy.
The reward model is trained via a loss function incorporating human preference signals and factual augmentation:
Where is the reward function parameterized by , denotes the image context, the prompt, and the candidate responses, and the preferred response. The key innovation is the explicit use of factual context during training, which penalizes hallucinated content and aligns model outputs with ground-truth evidence.
6. Real-World Implications
MMHal-Bench directly addresses the operational risk posed by hallucinations in multimodal AI. In application domains such as automated reporting, medical triage, assistive technologies, educational tools, and surveillance, factual inaccuracy can have significant consequences. The specificity of MMHal-Bench in evaluating and penalizing hallucinations establishes a higher standard of reliability for deployed systems.
Furthermore, by exposing the limitations of current alignment strategies and providing fine-grained, adversarial test cases, MMHal-Bench advances the iterative development of training techniques (e.g., Fact-RLHF) that explicitly target factual consistency. This facilitates the design of next-generation LMMs that are more transparent, accountable, and aligned with human expectations for truthfulness and robustness.
7. Broader Impact and Future Directions
By offering a rigorous, category-specific evaluation suite, MMHal-Bench expands the methodological foundation for multimodal model assessment. Its integration with automated (e.g., GPT-4) and human-in-the-loop evaluation pipelines supports both scalable benchmarking and the grounding of metrics in human interpretability.
A plausible implication is that widespread adoption of benchmarks like MMHal-Bench will increasingly influence both model development and deployment criteria, especially as multidisciplinary applications demand ever-greater factual reliability. The synergy between MMHal-Bench and training paradigms such as Fact-RLHF exemplifies the trend toward end-to-end frameworks that seamlessly connect evaluation and alignment, ultimately shaping the evolution of trustworthy multimodal AI systems.