MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark (2402.04788v3)

Published 7 Feb 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Multimodal LLMs (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparison, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges. The code and dataset are publicly available at our project homepage: \url{https://mLLM-judge.github.io/}.

PDF Abstract

Unveiling the Judging Capabilities of Multimodal LLMs

The integration of visual comprehension with linguistic processing in AI, through Multimodal LLMs (MLLMs), marks a significant stride towards achieving artificial general intelligence. In the context of this evolutionary step, our paper introduces a pioneering benchmark named MLLM-as-a-Judge, designed to systematically evaluate the efficacy of MLLMs in performing as autonomous evaluators across various multimodal tasks. This benchmark is tailored to rigorously scrutinize MLLMs' ability to offer judgments mirroring human preferences and discernment.

Benchmark Development and Key Findings

Our benchmark is built around three core tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. It encompasses a meticulously curated selection of 3,300 image-instruction pairs derived from a wide range of fields including text captioning, math reasoning, and infographic interpretation. Utilizing four prominent MLLMs — GPT-4V, Gemini, LLaVA, and CogVLM — we embark on an extensive evaluation to gauge their judgment consistency, bias inclination, and susceptibility to hallucinations against human-labeled standards.

A notable observation from our paper is the substantial alignment of MLLM judgments with human preferences in the field of Pair Comparisons. However, significant discrepancies emerge in Scoring Evaluation and Batch Ranking tasks, particularly in areas necessitating complex reasoning. These findings underline a crucial disparity between MLLM-generated judgments and human expectations, highlighting areas where these models falter.

Challenges and Implications

Our analysis further sheds light on persistent challenges faced by MLLMs. These include a propensity for biases — egocentric, position, and length biases — and a tendency towards generating hallucinatory responses. Interestingly, the application of Chain-of-Thought reasoning and integration of a vision expert system demonstrates potential in mitigating some of these biases.

Importantly, our work presents two novel datasets: MLLM-AS-A-JUDGE-HQ, comprising responses highly aligned with human judgments, and MLLM-AS-A-JUDGE-HARD, featuring responses marked by inconsistencies and hallucinations. These datasets are envisioned as a rigorous testing ground for advancing MLLMs.

Contributions and Future Directions

By introducing the MLLM-AS-A-JUDGE benchmark, our research paves the way for a systematic assessment of MLLM's judging abilities in multimodal tasks. The discrepancies uncovered between MLLM judgments and human preferences broach critical conversations about the need for enhanced algorithmic accuracy, fairness, and interpretability in AI evaluations.

In navigating the future landscape of MLLM research, it is imperative to address the identified limitations, biases, and hallucinations, to edge closer to developing MLLMs that can reliably perform judgment tasks across diverse modalities. Our benchmark and datasets stand as a testament to this evolving journey, urging the AI community to seek innovative solutions that bridge the gap between machine-generated judgments and human expectations.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Dongping Chen (28 papers)
Ruoxi Chen (22 papers)
Shilin Zhang (8 papers)
Yinuo Liu (4 papers)
Yaochen Wang (3 papers)
Huichi Zhou (17 papers)
Qihui Zhang (13 papers)
Pan Zhou (220 papers)
Yao Wan (70 papers)
Lichao Sun (186 papers)

Citations (35)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Dongping-Chen/MLLM-as-a-Judge: Official code repository for MLLM-as-a-Judge. (71 stars)

Tweets

YouTube

Show All Videos