Papers
Topics
Authors
Recent
2000 character limit reached

DivGenBench: Measuring Generative Diversity

Updated 6 January 2026
  • DivGenBench is a benchmark that quantifies Preference Mode Collapse by evaluating diversity across identity, style, layout, and tonal attributes in generated images.
  • It employs mathematically rigorous metrics (IDS, ASC, SDI, PVS) on 3,200 curated prompts to measure diversity and detect output homogenization in diffusion models.
  • Empirical results show that methods like D²-Align effectively mitigate collapse, preserving diversity and offering actionable insights for improving RLHF alignment strategies.

DivGenBench is a benchmark specifically developed to quantify and diagnose Preference Mode Collapse (PMC) in text-to-image diffusion models aligned via reinforcement learning from human feedback (RLHF). PMC is characterized by generative models converging toward narrow, high-scoring outputs that lack diversity across key semantic and perceptual dimensions. DivGenBench operationalizes PMC by probing model responses to a large set of attribute-driven prompts, utilizing mathematically rigorous metrics for diversity across identity, style, layout, and tonal properties. The benchmark provides both qualitative and quantitative assessment, facilitating direct comparison between different alignment frameworks and explicitly measuring the extent to which generative breadth is preserved or compromised (Chen et al., 30 Dec 2025).

1. Motivation and Phenomenological Background

PMC arises in diffusion-RL settings where reward optimization inadvertently drives models toward homogeneous output distributions, often manifesting as visually indistinguishable faces, overexposed artistic renderings, spatial layout gridlock, or failure to comply with tonal instructions. Prevailing benchmarks have primarily measured fidelity and default-mode variance, without targeted diagnosis of diversity collapse. DivGenBench was introduced to address this deficiency by systematically measuring a model's capacity to generate broad, non-collapsed outputs in response to explicit and diverse instructions. The approach is grounded in the hypothesis that PMC results from over-optimization along axes aligned with reward model biases, inducing convergence onto specific high-reward modes and degenerating the output diversity conditioned on varied prompts.

2. Benchmark Design and Prompt Taxonomy

DivGenBench comprises 3,200 prompts, partitioned equally into four diversity dimensions: identity (ID), artistic style (Style), spatial layout (Layout), and photographic tone (Tonal). Each dimension leverages expert-curated data templates and taxonomies:

  • Identity: Templates from FairFace, varied over 3 age groups, 6 ethnicities, 2 genders, and 40 facial features (CelebA). Prompts direct the model to produce photorealistic portraits with specified attributes.
  • Style: Prompts specifying artistic style and depicted object, spanning 27 classical styles (WikiArt) and base object classes (Parti).
  • Layout: Arrangements of 2–5 objects (COCO class taxonomy) on controlled backgrounds, designed to stress spatial dispersion.
  • Tonal: Image generation with prescribed saturation, contrast, brightness levels, employing 18 descriptors to probe compliance and variance in photographic tone.

The selection rationale ensures each prompt dimension isolates a distinct axis of generative diversity, minimizing confounding factors and facilitating precise scoring.

3. Quantitative Metrics and Mathematical Formulation

Four metrics are introduced to evaluate output diversity:

  • Identity Divergence Score (IDS): ArcFace embeddings in R512\mathbb{R}^{512} are employed to measure pairwise cosine similarity of generated faces. Lower IDS indicates greater identity variance (less collapse).

IDS=2N(N1)i=1Nj=i+1Nvivjvi  vj\mathrm{IDS} = \frac{2}{N(N-1)} \sum_{i=1}^{N} \sum_{j=i+1}^{N} \frac{v_i \cdot v_j}{\|v_i\|\;\|v_j\|}

with N=M×PN = M \times P total faces.

  • Artistic Style Coverage (ASC): Quantifies the retrievability of real-world artistic styles from generated samples, using CSD features and a retrieval/test paradigm over WikiArt data. The metric normalizes coverage relative to a reference set:

ASC=IRS(Xsynth)IRS(Xtest)\mathrm{ASC} = \frac{\mathrm{IRS}_\infty(\mathcal{X}_{\mathrm{synth}})}{\mathrm{IRS}_\infty(\mathcal{X}_{\mathrm{test}})}

where higher ASC denotes greater style breadth.

  • Spatial Dispersion Index (SDI): Measures layout variance by bounding box overlap computed via GroundingDINO and optimal matching (Hungarian algorithm). High SDI signals greater object configuration diversity:

$\mathrm{SDI} = \frac{1}{P} \sum_{r=1}^{P}\bigl(1 - \overline{\mathrm{Sim}^{(r)}\bigr}$

where Sim\overline{\mathrm{Sim}} computes average pairwise layout similarity.

  • Photographic Variance Score (PVS): Aggregates standard deviations of saturation, brightness, and contrast across samples in HSV and grayscale space:

PVS=std(s)+std(v)+std(c)\mathrm{PVS} = \mathrm{std}(\mathbf{s}) + \mathrm{std}(\mathbf{v}) + \mathrm{std}(\mathbf{c})

Large PVS values correspond to greater tonal diversity.

All metrics are aggregated over M=4M = 4 samples per prompt and P=800P = 800 prompts per dimension, yielding the DivGenBench score vector:

DB(G)=[MID,MStyle,MLayout,MTonal]DB(\mathcal{G}) = [M_{\mathrm{ID}}, M_{\mathrm{Style}}, M_{\mathrm{Layout}}, M_{\mathrm{Tonal}}]

4. Evaluation Protocol and Data Extraction

For each alignment method, the protocol consists of generating images at 720×720720\times720 resolution (25 sampling steps, guidance = 1.0), extracting corresponding features using pretrained inference systems (ArcFace for IDS, CSD for ASC, GroundingDINO for SDI, HSV conversion for PVS), and computing metrics across the batch of generated outputs. The ASC metric utilizes a reference WikiArt corpus (Ntrain50,000N_{\mathrm{train}} \approx 50{,}000), SDI evaluates over M=4M=4 images per prompt, PVS over N=3,200N=3{,}200 tonal images per full dimension.

5. Empirical Findings and Model Comparisons

Qualitative results demonstrate that existing RLHF alignment baselines (DanceGRPO, Flow-GRPO, SRPO) exhibit pronounced PMC, evidenced by homogeneous facial outputs, style reversion to reward hacks (e.g., default overexposed aesthetics), static object arrangements, and tonal monotony. Directional Decoupling Alignment (D2^2-Align), which applies a learned directional correction in the reward embedding space, effectively mitigates PMC, preserving diversity without sacrificing alignment fidelity.

Quantitative findings (main paper Tab. 3, model setting: HPS-v2.1) indicate that D2^2-Align achieves superior diversity metrics:

Method IDS (↓) ASC (↑) SDI (↑) PVS (↑)
FLUX 0.280 0.179 0.563 0.408
DanceGRPO 0.348 0.130 ... ...
Flow-GRPO 0.391 0.044 ... ...
SRPO 0.259 0.234 0.580 0.352
D2^2-Align 0.251 0.253 0.636 0.412

D2^2-Align achieves an 8.2% increase in ASC, 9.7% in SDI, and 17.0% in PVS over SRPO, while reducing IDS by 3.1%. When compared to unaligned FLUX, ASC, SDI, and PVS improvements are 41%, 13%, and 1% respectively (IDS lowered as well).

Human preference studies corroborate these results: diversity in Identity, Style, Layout, and Tonal dimensions favored D2^2-Align (e.g., 35.2% vs. FLUX 26.7% for Identity).

6. Significance and Implications

DivGenBench provides a comprehensive, multi-dimensional suite for rigorous evaluation of generative diversity, moving beyond standard fidelity metrics. Its explicit focus on diversity axes makes it a distinctive diagnostic tool for PMC, enabling precise attribution of collapse phenomena and comparative analysis of RLHF alignment strategies. The ability of D2^2-Align to restore diversity while maintaining reward-alignment represents a key advance in the mitigation of reward hacking in diffusion models. A plausible implication is wider adoption of targeted diversity benchmarks in generative modeling, influencing future methodological developments for reliable RLHF alignment (Chen et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DivGenBench.