Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

HPSv3: Human Preference Score v3

Updated 10 August 2025
  • HPSv3 is a metric that quantifies the alignment of generative outputs with human preferences through large-scale, diverse annotated datasets.
  • It employs an uncertainty-aware vision-language model that maps image-text pairs to probabilistic scores, reducing noise in preference judgments.
  • The Chain-of-Human-Preference mechanism iteratively refines outputs, guiding model selection and improving visual quality.

Human Preference Score v3 (HPSv3) is a metric and model for assessing and optimizing the alignment of generative outputs—primarily in text-to-image synthesis—with human preferences, using a scalable, uncertainty-aware architecture and large, diverse human-annotated datasets. The framework aims to overcome prior limitations of human-centric evaluation: restricted data spectrum, limited feature representations, and simplistic loss functions, thereby facilitating reliable automatic evaluation and iterative refinement through automated, human-aligned selection.

1. Wide-Spectrum Human Preference Dataset: HPDv3

HPDv3 is the basis for HPSv3, establishing coverage across diverse image qualities, prompt types, and generative models. It consists of:

  • 1.08 million text-image pairs.
  • 1.17 million pairwise human preference annotations.

Image sources include outputs from state-of-the-art autoregressive, diffusion, and DiT-based generative models, real photographs (serving as a quality upper bound), and crowdsourced images from platforms such as Midjourney. Each prompt typically has 9–19 separate annotators performing comparative judgments, which ensures high agreement reliability and robust ground truth. This dataset enables training and validation of preference models that are representative of actual user distributions over a spectrum of quality and prompt content.

2. Vision-Language Preference Model and Uncertainty-Aware Ranking

The preference model underpinning HPSv3 utilizes a Vision-LLM (VLM) backbone (e.g., Qwen2-VL) to obtain joint high-dimensional representations of image–text pairs. These are mapped via multilayer perceptron (MLP) layers to a probabilistic score:

  • For each image xx and prompt cc, the feature extractor Eθ(x,c)E_{\theta}(x, c) is mapped by fϕf_{\phi} to two values (μ,σ)(\mu, \sigma), the mean and standard deviation of a 1D Gaussian: rN(μ,σ)r \sim \mathcal{N}(\mu, \sigma).
  • Given two images with predicted Gaussians N(μ1,σ1)\mathcal{N}(\mu_1, \sigma_1) and N(μ2,σ2)\mathcal{N}(\mu_2, \sigma_2), the probability of preference is computed as:

P(x1x2c)=sigmoid(r1r2)  N(r1μ1,σ1)  N(r2μ2,σ2)  dr1dr2P(x_1 \succ x_2 | c) = \iint \text{sigmoid}(r_1 - r_2) \; \mathcal{N}(r_1 | \mu_1, \sigma_1) \; \mathcal{N}(r_2 | \mu_2, \sigma_2) \; dr_1 dr_2

The model is trained by minimizing the negative log-likelihood (or logistic ranking loss):

L=logP(xhxlc)L = -\log P(x_h \succ x_l | c)

where xhx_h and xlx_l are the human-preferred and less-preferred images, respectively. This probabilistic approach hedges against annotation noise and ambiguous judgments, producing reliable fine-grained rankings.

3. Chain-of-Human-Preference (CoHP)

CoHP is an iterative selection and refinement mechanism that leverages HPSv3 as a reward model to improve image generation in two stages:

a. Model-wise Preference Selection:

Among various candidate generative models for a prompt, HPSv3 is used to score outputs. The “golden model” mm^* is selected as:

m=argmaxi(1Njri,j)m^* = \arg\max_{i} \left( \frac{1}{N} \sum_j r_{i,j} \right)

where ri,jr_{i,j} are preference scores for images generated by model mim_i.

b. Sample-wise Iterative Refinement:

With the golden model identified, candidate images are generated in successive rounds. At round kk, the highest-scoring image is:

Ik=argmaxnrn,kI_k^* = \arg\max_n r_{n,k}

This image may be used as subsequent conditioning, and refinement proceeds until a final output is chosen. CoHP thus enables step-wise improvement guided by the robust metric.

4. Experimental Evaluation and Human Alignment

HPSv3 demonstrates superior alignment with human preference relative to prior automated metrics (e.g., CLIP, ImageReward, HPSv2):

  • Spearman’s rr reaches 0.94; Kendall’s τ\tau exceeds 0.82.
  • Pairwise accuracy improvements observed on PickScore, HPDv2, and HPDv3 benchmarks.
  • In reinforcement learning from human feedback (RLHF) setups (see algorithms such as DanceGRPO), using HPSv3 as the reward signal results in better visual fidelity: improved color saturation, lighting, and reduction of reward hacking artifacts.

This underscores HPSv3’s reliability as both a metric and reward function for generative image evaluation and improvement.

5. Technical Architecture and Mathematical Formulation

The scoring function is operationalized as:

  • For each pair (x1,x2)(x_1, x_2) under prompt cc:

r1=fϕ(Eθ(x1,c)), r2=fϕ(Eθ(x2,c)), r1,r2N(μ1,σ1),N(μ2,σ2)\begin{align*} r_1 &= f_{\phi}(E_{\theta}(x_1, c)), \ r_2 &= f_{\phi}(E_{\theta}(x_2, c)), \ r_1, r_2 &\sim \mathcal{N}(\mu_1, \sigma_1), \mathcal{N}(\mu_2, \sigma_2) \end{align*}

  • Pairwise preference probability:

P(x1x2c)=sigmoid(r1r2)N(r1μ1,σ1)N(r2μ2,σ2)dr1dr2P(x_1 \succ x_2 | c) = \iint \text{sigmoid}(r_1 - r_2) \mathcal{N}(r_1 | \mu_1, \sigma_1) \mathcal{N}(r_2 | \mu_2, \sigma_2) dr_1 dr_2

  • The learning objective (loss function):

L=log(1+exp(rlrh))L = \log(1 + \exp(r_l - r_h))

which is equivalent to minimizing KL-divergence between empirical and model distributions.

6. Dataset and Code Availability

All resources are released under the CC BY-NC-SA license at https://mizzenai.github.io/HPSv3.project/. This supports further research and non-commercial development in human-centric evaluation and model alignment.

7. Contextual Significance and Future Prospects

HPSv3 advances automated evaluation in generative modeling by integrating wide-spectrum annotated datasets and an uncertainty-aware ranking paradigm. It explicitly addresses previous shortcomings: narrow data coverage, basic feature extractors, and inefficiencies in ranking loss. Its probabilistic, model-agnostic design makes it adaptable to new domains and evolving generative architectures.

A plausible implication is that techniques integrating uncertainty-aware ranking and large diverse human-annotated datasets, as embodied by HPSv3, will become a foundational standard for automatic preference assessment. In addition, iterative refinement mechanisms such as CoHP can enable continual improvement through scalable, human-aligned selection without requiring new annotation rounds.

This positions HPSv3 as a benchmark for future developments in human preference modeling for generative systems (Ma et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)