VisionPrefer: Multimodal Preference Dataset

Updated 15 December 2025

VisionPrefer is a large-scale, fine-grained multimodal dataset providing 716,000 prompt-image pairs with detailed scores and rationales.
It supports robust reward modeling and policy optimization using RL techniques such as PPO and DPO with aspect-specific evaluations.
The dataset leverages AI annotators like GPT-4 Vision to ensure reliable, human-aligned image synthesis and explainable preference analysis.

VisionPrefer is a large-scale, fine-grained multimodal preference dataset developed to enable robust RL from AI Feedback (RLAIF) and instruction tuning for text-to-image generative models. Leveraging AI annotators, primarily GPT-4 Vision, VisionPrefer offers comprehensive preference judgments across several key image quality dimensions, facilitating both supervised and reinforcement learning for controllable, human-aligned image synthesis (Wu et al., 2024).

1. Dataset Scope and Structure

VisionPrefer contains 179,000 unique polished text prompts sourced from DiffusionDB. For every prompt, four images are generated using four state-of-the-art diffusion models—Stable Diffusion v1.5, Stable Diffusion 2.1, Dreamlike Photoreal 2.05, and SD XL—resulting in 716,000 prompt-image pairs. Each image is annotated with four scalar quality scores (on a 1–5 scale), one for each of the following aspects: prompt-following, aesthetic, fidelity, and harmlessness.

In addition, for each image and aspect, a concise textual rationale (2–3 sentences) is recorded. Pairwise comparisons are inferred between all possible image pairs (6 pairs per prompt), yielding 4.3 million initial pairwise preferences; 1.2 million are retained after filtering for ties.

Dataset entries conform to a structured JSONL schema:

Field	Content	Notes
prompt_id	Unique integer ID
prompt	Polished text (see §2)
images	List of 4 dicts: {image_id, model, guidance, path}	Each image per prompt
scores	Dict of 4 lists of int (each list length 4)	1–5 rating per image per aspect
rationales	Dict of 4 lists of str (each list length 4)	Per-image, per-aspect rationales
pairwise	List of dicts {"i", "j", "aspect", "pref"}	Which image preferred for aspect

This schema enables multi-aspect evaluation and facilitates downstream tasks including reward modeling, policy optimization, and explainable preference analysis.

2. Data Collection and Annotation Pipeline

VisionPrefer's construction comprises several sequential steps:

Prompt De-biasing: Raw prompts from DiffusionDB are first "polished" via a GPT-4 textual editing pipeline that removes platform/artist tags, resolution/style modifiers, corrects conflicts, and normalizes to a single concise instruction. NSFW content is filtered using Detoxify, excluding high-risk prompts.
Image Generation: Four diverse diffusion models are applied per prompt, sampling guidance scales randomly in the 3–12 range to maximize visual diversity and coverage of model-specific generations.
AI-based Preference Annotation: For every prompt, GPT-4 Vision—utilizing detailed prompt templates per aspect—assigns 1–5 scalar scores and concise textual rationales to each image. Pairwise preferences are automatically computed from scalar scores, producing six comparisons per prompt.
Human Validation: VisionPrefer's annotation reliability is assessed against HPD and ImageRewardDB benchmarks. GPT-4 Vision achieves pairwise agreement rates of 68–72%, comparable to human expert raters (65–78%). Ablation studies confirm GPT-4 Vision's superiority over Gemini Pro Vision and LLaVA 1.6 34B (accuracy 55–65%).

This pipeline ensures the resulting dataset captures subtle aspect-level preferences at scale with a degree of reliability directly comparable to human-annotated corpora.

3. Annotation Aspects and Statistical Properties

Annotations are performed over four orthogonal aspects:

Prompt-Following: Consistency with textual instructions
Aesthetic: Visual attributes including color, exposure, and composition
Fidelity: Geometric and semantic correctness (e.g., anatomical accuracy, object count)
Harmlessness: Absence of NSFW, violent, hateful, or privacy-violating elements

Quantitative statistics for per-image scalar scores:

Aspect	Mean	Std. Dev.	Distribution Shape
Prompt-Following	3.1	1.4	Almost uniform (1–5)
Aesthetic	3.9	1.0	Skewed high
Fidelity	4.3	0.9	Strongly peaked at 5
Harmlessness	4.7	0.6	Strongly peaked at 5

Pairwise comparisons cover 1.2 million non-tied image pairs, approximately 0.3 million per aspect.

Diversity metrics indicate polished prompts halve the frequency of style-specific tokens compared to raw DiffusionDB, and image samples are distributed comparably across all four generation models (23–27% each). Guidance scale is observed to significantly affect preference win rates, with higher guidance scales more frequently preferred.

4. Reward Model and Learning Objectives

The dataset is designed to enable construction of aspect-sensitive reward models and subsequent RL from AI feedback. The principal reward model, termed VP-Score, is based on the BLIP architecture (ViT-L image encoder, 12-layer text transformer). Its loss function over prompt $T$ and images $(x_i, x_j)$ is:

$\mathcal{L}(\theta) = -\mathbb{E}_{(T,x_i,x_j) \sim D} \log \sigma(f_\theta(T,x_i) - f_\theta(T,x_j))$

where $\sigma$ denotes the logistic sigmoid and $f_\theta$ maps the (prompt, image) pair to a scalar preference score.

Regularization strategies include freezing 70% of transformer layers and initializing the MLP head according to $N(0, 1/(d+1))$ . The training/validation/test split ratio is 80/10/10, with 1.2 million comparisons used. The final model achieves a mean preference prediction accuracy of 70.46% (comparable to HPS v2 at 71.32%).

5. Reinforcement Learning and Downstream Applications

VisionPrefer directly supports reinforcement learning with AI feedback:

PPO ("ReFL") and Direct Preference Optimization ("D3PO") are used to fine-tune diffusion model checkpoints (e.g., Stable Diffusion XL) on reward signals derived from VP-Score and other baselines.
During PPO, a two-step process alternates between reward modeling and policy updates over 20,000 DiffusionDB prompts and 10,000 ReFL prompts. Default settings include LR=1e-5 and batch=64.
With DPO, direct optimization occurs over the preference tuples. When compared to human-annotated multitask datasets (ImageRewardDB, HPD v2, Pick-a-Pic), models fine-tuned with VisionPrefer achieve win rates exceeding 50% in direct head-to-head human-evaluation studies.

Out-of-distribution testing on ReFL and HPD v2 reveals that models trained on VisionPrefer generalize robustly, especially regarding novel prompt distributions and guidance-scale mixing (see Figure 3b, 6 in (Wu et al., 2024)).

6. Practical Recommendations and Observed Impact

Recommended practices emerging from VisionPrefer include the use of GPT-4 Vision for prompt polishing and aspect-critical annotation, leveraging both scalar scores and textual rationales for training explainable, multi-task reward models. Reinforcement learning with VisionPrefer, via PPO or DPO, with integrated harmlessness scoring, effectively reduces prevalence of NSFW generations to 4.4%, compared to 20–22% in ablation settings.

A plausible implication is that VisionPrefer—by leveraging multimodal LLMs as scalable, reliable annotators—demonstrates that synthetic, fine-grained preference data can substitute or augment costly human labor while matching or surpassing state-of-the-art benchmarks in downstream text-to-image alignment and reward modeling performance (Wu et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to VisionPrefer Dataset.

VisionPrefer: Multimodal Preference Dataset

1. Dataset Scope and Structure

2. Data Collection and Annotation Pipeline

3. Annotation Aspects and Statistical Properties

4. Reward Model and Learning Objectives

5. Reinforcement Learning and Downstream Applications

6. Practical Recommendations and Observed Impact

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

VisionPrefer: Multimodal Preference Dataset

1. Dataset Scope and Structure

2. Data Collection and Annotation Pipeline

3. Annotation Aspects and Statistical Properties

4. Reward Model and Learning Objectives

5. Reinforcement Learning and Downstream Applications

6. Practical Recommendations and Observed Impact

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research