Papers
Topics
Authors
Recent
2000 character limit reached

VisionPrefer: Multimodal Preference Dataset

Updated 15 December 2025
  • VisionPrefer is a large-scale, fine-grained multimodal dataset providing 716,000 prompt-image pairs with detailed scores and rationales.
  • It supports robust reward modeling and policy optimization using RL techniques such as PPO and DPO with aspect-specific evaluations.
  • The dataset leverages AI annotators like GPT-4 Vision to ensure reliable, human-aligned image synthesis and explainable preference analysis.

VisionPrefer is a large-scale, fine-grained multimodal preference dataset developed to enable robust RL from AI Feedback (RLAIF) and instruction tuning for text-to-image generative models. Leveraging AI annotators, primarily GPT-4 Vision, VisionPrefer offers comprehensive preference judgments across several key image quality dimensions, facilitating both supervised and reinforcement learning for controllable, human-aligned image synthesis (Wu et al., 2024).

1. Dataset Scope and Structure

VisionPrefer contains 179,000 unique polished text prompts sourced from DiffusionDB. For every prompt, four images are generated using four state-of-the-art diffusion models—Stable Diffusion v1.5, Stable Diffusion 2.1, Dreamlike Photoreal 2.05, and SD XL—resulting in 716,000 prompt-image pairs. Each image is annotated with four scalar quality scores (on a 1–5 scale), one for each of the following aspects: prompt-following, aesthetic, fidelity, and harmlessness.

In addition, for each image and aspect, a concise textual rationale (2–3 sentences) is recorded. Pairwise comparisons are inferred between all possible image pairs (6 pairs per prompt), yielding 4.3 million initial pairwise preferences; 1.2 million are retained after filtering for ties.

Dataset entries conform to a structured JSONL schema:

Field Content Notes
prompt_id Unique integer ID
prompt Polished text (see §2)
images List of 4 dicts: {image_id, model, guidance, path} Each image per prompt
scores Dict of 4 lists of int (each list length 4) 1–5 rating per image per aspect
rationales Dict of 4 lists of str (each list length 4) Per-image, per-aspect rationales
pairwise List of dicts {"i", "j", "aspect", "pref"} Which image preferred for aspect

This schema enables multi-aspect evaluation and facilitates downstream tasks including reward modeling, policy optimization, and explainable preference analysis.

2. Data Collection and Annotation Pipeline

VisionPrefer's construction comprises several sequential steps:

  • Prompt De-biasing: Raw prompts from DiffusionDB are first "polished" via a GPT-4 textual editing pipeline that removes platform/artist tags, resolution/style modifiers, corrects conflicts, and normalizes to a single concise instruction. NSFW content is filtered using Detoxify, excluding high-risk prompts.
  • Image Generation: Four diverse diffusion models are applied per prompt, sampling guidance scales randomly in the 3–12 range to maximize visual diversity and coverage of model-specific generations.
  • AI-based Preference Annotation: For every prompt, GPT-4 Vision—utilizing detailed prompt templates per aspect—assigns 1–5 scalar scores and concise textual rationales to each image. Pairwise preferences are automatically computed from scalar scores, producing six comparisons per prompt.
  • Human Validation: VisionPrefer's annotation reliability is assessed against HPD and ImageRewardDB benchmarks. GPT-4 Vision achieves pairwise agreement rates of 68–72%, comparable to human expert raters (65–78%). Ablation studies confirm GPT-4 Vision's superiority over Gemini Pro Vision and LLaVA 1.6 34B (accuracy 55–65%).

This pipeline ensures the resulting dataset captures subtle aspect-level preferences at scale with a degree of reliability directly comparable to human-annotated corpora.

3. Annotation Aspects and Statistical Properties

Annotations are performed over four orthogonal aspects:

  • Prompt-Following: Consistency with textual instructions
  • Aesthetic: Visual attributes including color, exposure, and composition
  • Fidelity: Geometric and semantic correctness (e.g., anatomical accuracy, object count)
  • Harmlessness: Absence of NSFW, violent, hateful, or privacy-violating elements

Quantitative statistics for per-image scalar scores:

Aspect Mean Std. Dev. Distribution Shape
Prompt-Following 3.1 1.4 Almost uniform (1–5)
Aesthetic 3.9 1.0 Skewed high
Fidelity 4.3 0.9 Strongly peaked at 5
Harmlessness 4.7 0.6 Strongly peaked at 5

Pairwise comparisons cover 1.2 million non-tied image pairs, approximately 0.3 million per aspect.

Diversity metrics indicate polished prompts halve the frequency of style-specific tokens compared to raw DiffusionDB, and image samples are distributed comparably across all four generation models (23–27% each). Guidance scale is observed to significantly affect preference win rates, with higher guidance scales more frequently preferred.

4. Reward Model and Learning Objectives

The dataset is designed to enable construction of aspect-sensitive reward models and subsequent RL from AI feedback. The principal reward model, termed VP-Score, is based on the BLIP architecture (ViT-L image encoder, 12-layer text transformer). Its loss function over prompt TT and images (xi,xj)(x_i, x_j) is:

L(θ)=E(T,xi,xj)Dlogσ(fθ(T,xi)fθ(T,xj))\mathcal{L}(\theta) = -\mathbb{E}_{(T,x_i,x_j) \sim D} \log \sigma(f_\theta(T,x_i) - f_\theta(T,x_j))

where σ\sigma denotes the logistic sigmoid and fθf_\theta maps the (prompt, image) pair to a scalar preference score.

Regularization strategies include freezing 70% of transformer layers and initializing the MLP head according to N(0,1/(d+1))N(0, 1/(d+1)). The training/validation/test split ratio is 80/10/10, with 1.2 million comparisons used. The final model achieves a mean preference prediction accuracy of 70.46% (comparable to HPS v2 at 71.32%).

5. Reinforcement Learning and Downstream Applications

VisionPrefer directly supports reinforcement learning with AI feedback:

  • PPO ("ReFL") and Direct Preference Optimization ("D3PO") are used to fine-tune diffusion model checkpoints (e.g., Stable Diffusion XL) on reward signals derived from VP-Score and other baselines.
  • During PPO, a two-step process alternates between reward modeling and policy updates over 20,000 DiffusionDB prompts and 10,000 ReFL prompts. Default settings include LR=1e-5 and batch=64.
  • With DPO, direct optimization occurs over the preference tuples. When compared to human-annotated multitask datasets (ImageRewardDB, HPD v2, Pick-a-Pic), models fine-tuned with VisionPrefer achieve win rates exceeding 50% in direct head-to-head human-evaluation studies.

Out-of-distribution testing on ReFL and HPD v2 reveals that models trained on VisionPrefer generalize robustly, especially regarding novel prompt distributions and guidance-scale mixing (see Figure 3b, 6 in (Wu et al., 2024)).

6. Practical Recommendations and Observed Impact

Recommended practices emerging from VisionPrefer include the use of GPT-4 Vision for prompt polishing and aspect-critical annotation, leveraging both scalar scores and textual rationales for training explainable, multi-task reward models. Reinforcement learning with VisionPrefer, via PPO or DPO, with integrated harmlessness scoring, effectively reduces prevalence of NSFW generations to 4.4%, compared to 20–22% in ablation settings.

A plausible implication is that VisionPrefer—by leveraging multimodal LLMs as scalable, reliable annotators—demonstrates that synthetic, fine-grained preference data can substitute or augment costly human labor while matching or surpassing state-of-the-art benchmarks in downstream text-to-image alignment and reward modeling performance (Wu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to VisionPrefer Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube