Unveiling the Judging Capabilities of Multimodal LLMs
The integration of visual comprehension with linguistic processing in AI, through Multimodal LLMs (MLLMs), marks a significant stride towards achieving artificial general intelligence. In the context of this evolutionary step, our paper introduces a pioneering benchmark named MLLM-as-a-Judge, designed to systematically evaluate the efficacy of MLLMs in performing as autonomous evaluators across various multimodal tasks. This benchmark is tailored to rigorously scrutinize MLLMs' ability to offer judgments mirroring human preferences and discernment.
Benchmark Development and Key Findings
Our benchmark is built around three core tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. It encompasses a meticulously curated selection of 3,300 image-instruction pairs derived from a wide range of fields including text captioning, math reasoning, and infographic interpretation. Utilizing four prominent MLLMs — GPT-4V, Gemini, LLaVA, and CogVLM — we embark on an extensive evaluation to gauge their judgment consistency, bias inclination, and susceptibility to hallucinations against human-labeled standards.
A notable observation from our paper is the substantial alignment of MLLM judgments with human preferences in the field of Pair Comparisons. However, significant discrepancies emerge in Scoring Evaluation and Batch Ranking tasks, particularly in areas necessitating complex reasoning. These findings underline a crucial disparity between MLLM-generated judgments and human expectations, highlighting areas where these models falter.
Challenges and Implications
Our analysis further sheds light on persistent challenges faced by MLLMs. These include a propensity for biases — egocentric, position, and length biases — and a tendency towards generating hallucinatory responses. Interestingly, the application of Chain-of-Thought reasoning and integration of a vision expert system demonstrates potential in mitigating some of these biases.
Importantly, our work presents two novel datasets: MLLM-AS-A-JUDGE-HQ, comprising responses highly aligned with human judgments, and MLLM-AS-A-JUDGE-HARD, featuring responses marked by inconsistencies and hallucinations. These datasets are envisioned as a rigorous testing ground for advancing MLLMs.
Contributions and Future Directions
By introducing the MLLM-AS-A-JUDGE benchmark, our research paves the way for a systematic assessment of MLLM's judging abilities in multimodal tasks. The discrepancies uncovered between MLLM judgments and human preferences broach critical conversations about the need for enhanced algorithmic accuracy, fairness, and interpretability in AI evaluations.
In navigating the future landscape of MLLM research, it is imperative to address the identified limitations, biases, and hallucinations, to edge closer to developing MLLMs that can reliably perform judgment tasks across diverse modalities. Our benchmark and datasets stand as a testament to this evolving journey, urging the AI community to seek innovative solutions that bridge the gap between machine-generated judgments and human expectations.