MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? (2407.04842v1)

Published 5 Jul 2024 in cs.CV, cs.CL, and cs.LG

Abstract: While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. Notably, human evaluations on end-to-end fine-tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. All data, code, models are available at https://huggingface.co/MJ-Bench.

PDF HTML Abstract

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

The paper "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?" addresses a pressing need in evaluating the effectiveness of multimodal foundation models (FMs) serving as reward models for text-to-image generative tasks. Motivated by the challenges that text-to-image models like DALLE-3 and Stable Diffusion face, such as hallucination, biased generation, and production of unsafe or low-quality outputs, the authors propose MJ-Bench, a novel benchmarking framework tailored to evaluate multimodal judges across four critical dimensions: alignment, safety, image quality, and bias.

Benchmark Overview

MJ-Bench synthesizes a comprehensive preference dataset that systematically evaluates a broad spectrum of multimodal judges. This encompasses smaller CLIP-based scoring models, open-source vision-LLMs (VLMs) like LLaVA, and several proprietary VLMs, notably GPT-4 and Claude 3. The constructed benchmark rigorously tests these judges on each of the decomposed subcategories within the four main perspectives mentioned.

Key Findings:

Alignment: Closed-source VLMs, particularly GPT-4o, offered superior feedback compared to other judges. Smaller CLIP-based models performed better in text-image alignment and image quality.
Safety: VLMs provided more accurate feedback regarding safety and bias due to their advanced reasoning capabilities.
Image Quality: Numerical scale assessments evidenced that smaller CLIP-based models had a pronounced advantage in terms of feedback accuracy regarding alignment and quality.
Bias: Incorporating a novel combination of evaluation metrics, including the Gini-based Equality Score and Normalized Dispersion Score, MJ-Bench highlights how different judges grapple with demographic biases, revealing significant disparities in their assessments.

MJ-Bench's dataset is meticulously structured to ensure that evaluation is multifaceted and captures subtle nuances in image generation tasks. Each perspective is subdivided into several subcategories, enabling a nuanced and comprehensive assessment of reward models.

Dataset Composition and Metrics

Alignment: The dataset evaluates text-to-image alignment based on five verifiable sub-objectives: object presence, attribute accuracy, action depiction, spatial relationships, and object counting. This allows for a detailed understanding of how well the models maintain coherence between textual prompts and generated images. For example, the attribute sub-category focuses on ensuring elements like color and texture are correctly rendered as per the given instruction.

Safety: Divided into toxicity (crime, shocking, and disgust) and NSFW (evident, subtle, and evasive) subcategories, this dataset tests the models' ability to filter out harmful or inappropriate content. This is particularly crucial in preventing the generation of offensive images, and the findings underline the importance of advanced reasoning capabilities in contemporary VLMs.

Image Quality: Using distortions (human artifacts) and blurring techniques on synthetic and real-world images, this category assesses the robustness of image quality against common real-world degradation effects. Judges are expected to provide feedback that facilitates the generation of sharper, artifact-free images, which is integral to applications demanding high fidelity.

Bias: MJ-Bench's bias assessment involves a comprehensive evaluation across various demographic dimensions (age, race, gender, nationality, religion). This is crucial in ensuring the models do not reinforce stereotypes and maintain fairness in their generative outputs. The dataset's structured pairwise combinations of demographic characteristics allow for a detailed understanding of potential biases.

Evaluation Findings

Alignment Feedback: VLMs performed well, with GPT-4o standing out in accurately assessing text-image alignment. This indicates that more sophisticated, closed-source VLMs can effectively handle complex generative tasks compared to their open-source counterparts.
Safety Feedback: The benchmark's findings reveal that models like GPT-4o and Gemini Ultra offer reliable safety assessments, significantly better than other models. Nonetheless, this highlights the importance of ongoing improvements in open-source models to match the safety evaluation prowess of proprietary solutions.
Image Quality Feedback: Scoring models demonstrated variance in feedback accuracy, reflecting their different pretraining regimes. HPS-v2.1 and PickScore-v1 led in providing higher fidelity image quality assessments.
Bias Assessment: The paper's use of multiple metrics (ACC, GES, NDS) provided a nuanced understanding of demographic bias within the judges' feedback. Closed-source VLMs generally yielded less biased outputs, pointing to their extensive pretraining on diverse datasets.

Implications and Future Prospects

The findings from MJ-Bench indicate several critical implications for the development and deployment of text-to-image generative models:

The necessity of using sophisticated, closed-source VLMs like GPT-4o for nuanced and safe generative tasks.
The requirement for continuous advancement in open-source models to achieve comparable efficacy in multimodal assessments.
Emphasizing the need for a balanced pretraining dataset to mitigate inherent biases and ensure fair generative outputs.

Looking forward, as AI continues to evolve, MJ-Bench presents a robust framework for guiding improvements in multimodal reward models. This benchmark not only highlights current capabilities and limitations but also sets the stage for iterative enhancements to foster the generation of aligned, safe, high-quality, and unbiased images.

In conclusion, MJ-Bench provides a meticulous, multi-faceted evaluation protocol that serves as a crucial tool for advancing the field of text-to-image generative models. Its contributions lie in offering a comprehensive perspective necessary to refine and align such systems with human expectations and ethical standards.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Zhaorun Chen (28 papers)
Yichao Du (13 papers)
Zichen Wen (14 papers)
Yiyang Zhou (33 papers)
Chenhang Cui (14 papers)
Zhenzhen Weng (13 papers)
Haoqin Tu (25 papers)
Chaoqi Wang (16 papers)
Zhengwei Tong (1 paper)
Qinglan Huang (4 papers)
Canyu Chen (26 papers)
Qinghao Ye (31 papers)
Zhihong Zhu (45 papers)
Yuqing Zhang (53 papers)
Jiawei Zhou (77 papers)
Zhuokai Zhao (21 papers)
Rafael Rafailov (37 papers)
Chelsea Finn (264 papers)
Huaxiu Yao (103 papers)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/rm_rafailov/status/1810850489858429273

https://twitter.com/ZRChen_AISafety/status/1810750186735685747

https://twitter.com/Great6yd/status/1811116770226532429

https://twitter.com/_vztu/status/1810747647969665320

https://twitter.com/fiddlerlabs/status/1821246434638766184

https://twitter.com/susumuota/status/1817350375541797009