Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation (2404.15100v1)

Published 23 Apr 2024 in cs.CV and cs.MM

Abstract: Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal LLMs to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

PDF Abstract

Multimodal AI as Human-Aligned Annotators: Optimal Strategies in Text-to-Image Generation

The paper "Multimodal LLM is a Human-Aligned Annotator for Text-to-Image Generation" presents an analytical approach to leveraging multimodal LLMs (MLLMs) such as GPT-4 Vision, as annotators to optimize text-to-image generative models. Given the recent advancements yet existing challenges in the field—such as generating images that adhere to human preferences while maintaining safety and quality—this paper proposes a cost-effective and scalable solution: VisionPrefer.

VisionPrefer constitutes a dataset sourced from MLLMs, capturing diverse aspects of image quality like prompt-following, aesthetic, fidelity, and harmlessness. The intriguing hypothesis here is the use of AI models, trained extensively on text-image pairs, to act as quasi-human annotators. This is posited as an alternative to the labor-intensive and often biased human annotation processes currently dominating the field.

The authors validate VisionPrefer by constructing a reward model—VP-Score—and deploying reinforcement learning from AI feedback (RLAIF) techniques to enhance generative models’ outputs. Impressively, VP-Score not only achieves commendable accuracy compared to human-derived preference models like HPS v2 but also consistently enhances aesthetic alignment and reduces harmful content in generated images.

One of the paper's strong numerical assertions lies in VP-Score's preference prediction accuracy. Tested on human-preference datasets such as ImageRewardDB and HPD v2, it exhibits a competitive average performance of 70.46%, indicating that AI-generated preference data can approach, and in some respects match human judgment. These findings are pivotal as they demonstrate that AI has matured enough to produce coherent feedback systems that align closely with human aesthetics and expectations.

The implications of utilizing AI as annotators are multifold. Practically, it presents an efficient data annotation process that is scalable and less resource-intensive. Theoretically, it propels the discourse on whether, and to what extent AI systems can emulate human judgment and preference in the domain of creative content generation. Such developments pose potential shifts in how datasets for AI training are conceptualized and constructed, offering a sustainable model in building extensive databases required for advancing AI research.

Future developments in AI could explore further refinement of these MLLMs to enhance fidelity in tricky areas, like subtle artistic interpretations or cultural artifacts, which still might require human intuition. Additionally, fine-tuning such AI systems to capture and adapt to evolving aesthetic trends could lead to an increased understanding of the dynamic nature of human preferences.

In conclusion, this paper meticulously demonstrates the capability and practicality of leveraging MLLMs as human-aligned annotators, offering a promising approach to improve text-to-image generative models systematically. The success of VP-Score, underpinned by the VisionPrefer dataset, emboldens the future exploration of aligning AI outputs more intimately with human values, potentially leading to robust frameworks for future AI-driven creative applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Xun Wu (17 papers)
Shaohan Huang (79 papers)
Furu Wei (291 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/agi2025/status/1783132417953583590

https://twitter.com/arXivBangers/status/1783797137127641432