AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation (2311.07397v2)

Published 13 Nov 2023 in cs.CL and cs.CV

Abstract: Despite making significant progress in multi-modal tasks, current Multi-modal LLMs (MLLMs) encounter the significant challenge of hallucinations, which may lead to harmful consequences. Therefore, evaluating MLLMs' hallucinations is becoming increasingly important in model improvement and practical application deployment. Previous works are limited in high evaluation costs (e.g., relying on humans or advanced LLMs) and insufficient evaluation dimensions (e.g., types of tasks and hallucinations). In this paper, we propose an LLM-free multi-dimensional benchmark AMBER, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. Based on AMBER, we design a low-cost and efficient evaluation pipeline. Additionally, we conduct a comprehensive evaluation and detailed analysis of mainstream MLLMs including GPT-4V(ision), and also give guideline suggestions for mitigating hallucinations. The data and code of AMBER are available at https://github.com/junyangwang0410/AMBER.

PDF Abstract

AMBER: An Advanced Benchmark for Multi-Modal LLM Hallucination Evaluation

The paper presents a comprehensive examination of the hallucination problem prevalent in Multi-modal LLMs (MLLMs) and introduces a novel benchmark named AMBER. The primary motivation behind this paper is the observed tendency of MLLMs to produce hallucinatory content—statements that seem plausible but are unfaithful to the input image. The propensity for hallucinations poses significant challenges for deploying these models in practical applications where accuracy and truthfulness are paramount.

Hallucination Challenges in MLLMs

MLLMs have significantly advanced the ability to perform complex vision-language tasks by integrating visual encoders with traditional LLMs. However, despite these advancements, MLLMs often exhibit a tendency to generate content that does not accurately reflect the visual inputs they process, a phenomenon commonly referred to as "hallucination." Existing methods for evaluating hallucinations in these models suffer from several drawbacks, including high costs and limited evaluation dimensions, as they frequently rely on human judgment or other LLMs for validation, and they often do not cover the full spectrum of hallucination types.

The AMBER Benchmark

To address these limitations, the authors propose AMBER, an LLM-free multi-dimensional benchmark designed to evaluate both generative and discriminative tasks for existence, attribute, and relation hallucinations.

Data Collection and Annotation: The construction of AMBER involves sourcing a diverse set of high-quality images and providing detailed annotations covering objects, attributes, relations, and potential hallucinatory targets. This robust dataset enables comprehensive evaluations across different hallucination types.
Evaluation Metrics and Pipeline: AMBER is equipped with a low-cost and efficient evaluation pipeline. It uses a combination of metrics such as CHAIR, Cover, Hal, and Cog for generative tasks, and standard classification metrics (Accuracy, Precision, Recall, and F1) for discriminative tasks. These evaluations are made without relying on external LLMs, thus ensuring scalability and cost-effectiveness.

Key Findings

The paper conducts a detailed evaluation of nine mainstream MLLMs, including models like GPT-4V(ision) and others, employing the AMBER benchmark to assess their susceptibility to hallucinations. The results show that hallucinations are prevalent among many current MLLMs, with varying degrees of severity across models. Notably, GPT-4V demonstrated the highest efficacy among the tested models in terms of minimizing hallucinations while maintaining coverage of actual image content.

Generative vs. Discriminative Tasks: The analysis reveals that MLLMs exhibit more frequent hallucinations in open-ended generative tasks compared to discriminative tasks, suggesting that structured task queries may help reduce hallucination incidence.
Model-specific Observations: The paper highlights variations among models, indicating that both the architecture and the training data significantly impact the incidence of hallucinations. Upgraded models, enhanced with more robust visual and language components, show improved performance, suggesting potential strategies for mitigating hallucinations.

Implications and Future Directions

The establishment of the AMBER benchmark marks a significant step towards providing a standardized and rigorous framework for evaluating hallucinations in MLLMs. The insights garnered from this paper hold critical implications for both model development and practical deployment:

Enhancing Training Protocols: The findings underscore the necessity of incorporating diverse and exhaustive datasets that capture a wide array of possible attributes and relations to mitigate hallucinations.
Improved Model Design: Building upon the insights related to hallucination biases, future work should focus on refining model architectures and training regimes, possibly integrating more sophisticated error correction mechanisms.
Broadening Application Scenarios: By effectively identifying and addressing hallucinations, MLLMs can be made more reliable for real-world applications, spanning domains where accurate vision-language interpretation is crucial.

In conclusion, the AMBER benchmark provides a critical tool for the AI research community to assess and improve the fidelity of MLLMs, thereby contributing to the development of models that are both technically robust and practically applicable.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Junyang Wang (24 papers)
Yuhang Wang (54 papers)
Guohai Xu (21 papers)
Jing Zhang (730 papers)
Yukai Gu (2 papers)
Haitao Jia (3 papers)
Ming Yan (190 papers)
Ji Zhang (176 papers)
Jitao Sang (71 papers)
Jiaqi Wang (218 papers)
Haiyang Xu (67 papers)

Citations (74)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - junyangwang0410/AMBER: An LLM-free Multi-dimensional Benchmark for Multi-modal Hallucination Evaluation (124 stars)