Papers
Topics
Authors
Recent
2000 character limit reached

Preference Mode Collapse in Generative Models

Updated 6 January 2026
  • Preference Mode Collapse is characterized by a model’s output concentrating on a narrow set of modes that maximize reward, sacrificing diversity.
  • It is quantified using metrics such as support-size collapse, entropy drop, and spatial dispersion, highlighting its impact in both diffusion models and LLMs.
  • Mitigation strategies like D²-Align for diffusion models and Verbalized Sampling for LLMs show promise in restoring diversity while maintaining alignment quality.

Preference Mode Collapse (PMC) refers to a systemic failure mode in preference-driven training or reinforcement learning from human feedback (RLHF), in which an aligned generative model—such as a diffusion model or a LLM—collapses its output distribution onto a narrow, reward-exploiting subset of modes. This phenomenon severely degrades the diversity of generated samples, producing outputs that optimize alignment metrics but fail to capture the full spectrum of human preferences or creative variety (Chen et al., 30 Dec 2025, Zhang et al., 1 Oct 2025).

1. Formal Definition and Manifestation

PMC is characterized by the concentration of a model’s conditional output law pθ(xc)p_\theta(x|c) (for a generative model GθG_\theta and condition cc) onto a “high-score” manifold MXM \subset X, where R(x,c)R(x, c), a scalar reward function or learned preference model, is maximized. Formally,

KL(pθ(xc)Unif(M))0,\text{KL}(p_\theta(x|c) \parallel \text{Unif}(M)) \to 0,

even as MM fails to capture the true diversity or plurality of acceptable outputs per human judgment. In text-to-image diffusion, this can manifest as all “Cubism” outputs being trivially overexposed or as the generator learning a monolithic “glossy, highly-lit” style favored by the reward model’s biases, despite human preference for varied artistic interpretations (Chen et al., 30 Dec 2025).

Analogously, in LLMs, preference mode collapse is evidenced when the aligned model πθ(yx)\pi_\theta(y|x) assigns most probability mass to a small subset of completions, reducing entropy and the support size of generations. Metrics include:

  • Support-size collapse: {y:πθ(yx)>τ}|\{y : \pi_\theta(y|x) > \tau\}| for small τ\tau.
  • Entropy collapse: H(πθ(x))=yπθ(yx)logπθ(yx)H(\pi_\theta(\cdot|x)) = -\sum_{y} \pi_\theta(y|x)\log\pi_\theta(y|x) (Zhang et al., 1 Oct 2025).

2. Benchmarking and Quantification

PMC is systematically measured using dimensional diversity benchmarks. In diffusion, DivGenBench is employed, comprising 3200 “keyword-driven” prompts across four orthogonal axes: Identity, Artistic Style, Layout, and Tonal/photographic properties. Key metrics include:

Metric Diversity Axis Formula/Description
IDS Identity ArcFace embedding pairwise cosine similarity; lower is better
ASC Artistic Style Retrieval fraction of real styles; higher means closer to real
SDI Layout Spatial box IoU dispersion, higher denotes more layout diversity
PVS Tonal/Photographic Std-dev of HSV and contrast, higher is more diverse

For LLMs, quantification draws on metrics such as Distinct-n (unique n-grams), semantic embedding diversity, and entropy. Additionally, Coverage-N measures the breadth of responses in tasks such as dialogue or open-ended QA (Zhang et al., 1 Oct 2025).

3. Theoretical Drivers of PMC

PMC arises due to inherent biases in reward modeling and preference data:

  • Reward Model Bias: In diffusion, reward functions R(x,c)R(x, c) develop intrinsic affinities (e.g., toward specific color palettes). RLHF over-optimizes along these directions:

Cov(R(x,c),ϕstyle(x))σRσϕ increases with RLHF iterations.\frac{\text{Cov}(R(x, c), \phi_{\text{style}}(x))}{\sigma_R\sigma_\phi} \text{ increases with RLHF iterations}.

This correlation leads to pθ(xc)p_\theta(x|c) collapsing onto a regime with maximal ϕstyle\phi_{\text{style}} (Chen et al., 30 Dec 2025).

  • Typicality Bias in Annotation: In LLM alignment, human annotators systemically favor “typical” completions, as captured by the base model’s (pre-alignment) likelihood. The learned preference function decomposes as

r(x,y)=rtrue(x,y)+αlogπref(yx)+ε(x,y),r(x, y) = r_{\text{true}}(x, y) + \alpha \log \pi_{\text{ref}}(y|x) + \varepsilon(x, y),

where α>0\alpha > 0 quantifies typicality bias. The resulting aligned model sharpens the pretraining distribution:

π(yx)πref(yx)1+α/βexp(rtrue(x,y)/β),\pi^*(y|x) \propto \pi_{\text{ref}}(y|x)^{1+\alpha/\beta}\exp(r_{\text{true}}(x, y)/\beta),

so diversity collapses when many outputs are tied in true utility (Zhang et al., 1 Oct 2025).

4. Mitigation Strategies

4.1. Directional Decoupling Alignment (D²-Align) for Diffusion

D²-Align mitigates PMC by introducing a learned correction direction bvRdb_v \in \mathbb{R}^d in the reward CLIP embedding space:

  • Stage 1: With GθG_\theta frozen, bvb_v is optimized by constructing guided text and image embeddings and adjusting the reward function

Rguided(x0,c;bv)=eimg,e~textR_{\text{guided}}(x_0, c; b_v) = \langle e_{\text{img}}, \widetilde{e}_{\text{text}} \rangle

to decouple the main bias direction.

  • Stage 2: With bvb_v^* fixed, align GθG_\theta using

minθLstage2(θ)=Ec,x0Gθ(c)[Rguided(x0,c;bv)].\min_\theta \mathcal{L}_{\text{stage2}}(\theta) = -\,\mathbb{E}_{c, x_0 \sim G_\theta(c)} [R_{\text{guided}}(x_0, c; b_v^*)].

This yields improved sample diversity while maintaining reward quality. Only ∼20 RL steps are needed in Stage 2 compared to 300+ for baselines (Chen et al., 30 Dec 2025).

4.2. Verbalized Sampling (VS) for LLMs

VS is a training-free, inference-time remedy exploiting the model’s pretraining distribution. The central mechanism is to prompt the model to output several completions and their probabilities (“verbalized distribution”):

  • Instance-level: generate one sample,
  • List-level: generate kk samples,
  • Distribution-level (VS-Standard): generate kk samples with their explicit probabilities, summing to 1.

Repeated VS calls recover most of the pre-alignment model’s diversity, mitigating the effect of collapse imposed by typicality bias (Zhang et al., 1 Oct 2025).

5. Empirical Evidence

5.1. Diffusion Models (DivGenBench; Reward: HPS-v2.1)

D²-Align achieves the best diversity-quality trade-off among evaluated RL-based approaches:

Method IDS (↓) ASC (↑) SDI (↑) PVS (↑)
FLUX 0.280 0.179 0.563 0.408
DanceGRPO 0.348 0.130 0.488 0.259
FlowGRPO 0.391 0.044 0.389 0.168
SRPO 0.259 0.234 0.580 0.352
D²-Align 0.251 0.253 0.636 0.412
  • D²-Align combines near-best aesthetic and PickScore with maximal diversity.
  • Qualitatively, it generates distinct identities, authentic stylistic coverage, varied layouts, and correct tonal execution (Chen et al., 30 Dec 2025).

5.2. LLMs

Across creative writing, dialogue, QA, and synthetic math data generation:

  • VS improves semantic diversity by 1.6–2.1× in poems, 1.9–2.4× in stories, and up to 3× in jokes. Larger models benefit more (+12 percentage points diversity gain in GPT-4.1 vs +6 in smaller variants).
  • VS achieves high-quality generations (equaling or exceeding direct prompting), does not erode factual accuracy or safety (>97% refusal on harmful prompts), and permits user-tunable diversity-quality tradeoff via explicit probability thresholds (Zhang et al., 1 Oct 2025).
  • In math data, VS-based synthetic training boosts downstream accuracy on benchmarks by up to +4.7 pp over direct generation.

6. Comparative Analysis and Limitations

PMC is a distinct form of reward hacking driven by intrinsic reward model or data biases, rather than optimization failures alone. The D²-Align approach is plug-and-play (applicable to other RLHF pipelines), requires orders-of-magnitude fewer RL steps for alignment, and empirically breaks the fidelity–diversity tradeoff in text-to-image tasks. However, it currently relies on a single, frozen reward model; future improvements may require ensemble or adaptive reward models and higher-order embedding corrections.

In LLMs, typicality bias is both empirically pervasive and theoretically guaranteed to cause diversity collapse for any positive α\alpha. VS is model-agnostic and API-only, but increases inference cost and offers less benefit for weaker models. Data-centric alignment, such as pluralistic reward modeling, and more exploration-oriented RL objectives remain as open research directions (Zhang et al., 1 Oct 2025).

7. Broader Implications and Future Directions

Preference Mode Collapse highlights a fundamental failure of current alignment paradigms—high automated preference scores do not guarantee genuine diversity or human-like creativity. Methodological proposals such as D²-Align and Verbalized Sampling supply practical mitigation, but further progress depends on more robust preference modeling, ensemble-based or continual adaptation to shifting annotation biases, and explicit modeling of diversity objectives. Extending benchmarks like DivGenBench to richer axes (e.g., object interaction, scene complexity) and cross-domain tasks is an ongoing challenge for the field (Chen et al., 30 Dec 2025, Zhang et al., 1 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Preference Mode Collapse (PMC).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube