Preference Mode Collapse in Generative Models
- Preference Mode Collapse is characterized by a model’s output concentrating on a narrow set of modes that maximize reward, sacrificing diversity.
- It is quantified using metrics such as support-size collapse, entropy drop, and spatial dispersion, highlighting its impact in both diffusion models and LLMs.
- Mitigation strategies like D²-Align for diffusion models and Verbalized Sampling for LLMs show promise in restoring diversity while maintaining alignment quality.
Preference Mode Collapse (PMC) refers to a systemic failure mode in preference-driven training or reinforcement learning from human feedback (RLHF), in which an aligned generative model—such as a diffusion model or a LLM—collapses its output distribution onto a narrow, reward-exploiting subset of modes. This phenomenon severely degrades the diversity of generated samples, producing outputs that optimize alignment metrics but fail to capture the full spectrum of human preferences or creative variety (Chen et al., 30 Dec 2025, Zhang et al., 1 Oct 2025).
1. Formal Definition and Manifestation
PMC is characterized by the concentration of a model’s conditional output law (for a generative model and condition ) onto a “high-score” manifold , where , a scalar reward function or learned preference model, is maximized. Formally,
even as fails to capture the true diversity or plurality of acceptable outputs per human judgment. In text-to-image diffusion, this can manifest as all “Cubism” outputs being trivially overexposed or as the generator learning a monolithic “glossy, highly-lit” style favored by the reward model’s biases, despite human preference for varied artistic interpretations (Chen et al., 30 Dec 2025).
Analogously, in LLMs, preference mode collapse is evidenced when the aligned model assigns most probability mass to a small subset of completions, reducing entropy and the support size of generations. Metrics include:
- Support-size collapse: for small .
- Entropy collapse: (Zhang et al., 1 Oct 2025).
2. Benchmarking and Quantification
PMC is systematically measured using dimensional diversity benchmarks. In diffusion, DivGenBench is employed, comprising 3200 “keyword-driven” prompts across four orthogonal axes: Identity, Artistic Style, Layout, and Tonal/photographic properties. Key metrics include:
| Metric | Diversity Axis | Formula/Description |
|---|---|---|
| IDS | Identity | ArcFace embedding pairwise cosine similarity; lower is better |
| ASC | Artistic Style | Retrieval fraction of real styles; higher means closer to real |
| SDI | Layout | Spatial box IoU dispersion, higher denotes more layout diversity |
| PVS | Tonal/Photographic | Std-dev of HSV and contrast, higher is more diverse |
For LLMs, quantification draws on metrics such as Distinct-n (unique n-grams), semantic embedding diversity, and entropy. Additionally, Coverage-N measures the breadth of responses in tasks such as dialogue or open-ended QA (Zhang et al., 1 Oct 2025).
3. Theoretical Drivers of PMC
PMC arises due to inherent biases in reward modeling and preference data:
- Reward Model Bias: In diffusion, reward functions develop intrinsic affinities (e.g., toward specific color palettes). RLHF over-optimizes along these directions:
This correlation leads to collapsing onto a regime with maximal (Chen et al., 30 Dec 2025).
- Typicality Bias in Annotation: In LLM alignment, human annotators systemically favor “typical” completions, as captured by the base model’s (pre-alignment) likelihood. The learned preference function decomposes as
where quantifies typicality bias. The resulting aligned model sharpens the pretraining distribution:
so diversity collapses when many outputs are tied in true utility (Zhang et al., 1 Oct 2025).
4. Mitigation Strategies
4.1. Directional Decoupling Alignment (D²-Align) for Diffusion
D²-Align mitigates PMC by introducing a learned correction direction in the reward CLIP embedding space:
- Stage 1: With frozen, is optimized by constructing guided text and image embeddings and adjusting the reward function
to decouple the main bias direction.
- Stage 2: With fixed, align using
This yields improved sample diversity while maintaining reward quality. Only ∼20 RL steps are needed in Stage 2 compared to 300+ for baselines (Chen et al., 30 Dec 2025).
4.2. Verbalized Sampling (VS) for LLMs
VS is a training-free, inference-time remedy exploiting the model’s pretraining distribution. The central mechanism is to prompt the model to output several completions and their probabilities (“verbalized distribution”):
- Instance-level: generate one sample,
- List-level: generate samples,
- Distribution-level (VS-Standard): generate samples with their explicit probabilities, summing to 1.
Repeated VS calls recover most of the pre-alignment model’s diversity, mitigating the effect of collapse imposed by typicality bias (Zhang et al., 1 Oct 2025).
5. Empirical Evidence
5.1. Diffusion Models (DivGenBench; Reward: HPS-v2.1)
D²-Align achieves the best diversity-quality trade-off among evaluated RL-based approaches:
| Method | IDS (↓) | ASC (↑) | SDI (↑) | PVS (↑) |
|---|---|---|---|---|
| FLUX | 0.280 | 0.179 | 0.563 | 0.408 |
| DanceGRPO | 0.348 | 0.130 | 0.488 | 0.259 |
| FlowGRPO | 0.391 | 0.044 | 0.389 | 0.168 |
| SRPO | 0.259 | 0.234 | 0.580 | 0.352 |
| D²-Align | 0.251 | 0.253 | 0.636 | 0.412 |
- D²-Align combines near-best aesthetic and PickScore with maximal diversity.
- Qualitatively, it generates distinct identities, authentic stylistic coverage, varied layouts, and correct tonal execution (Chen et al., 30 Dec 2025).
5.2. LLMs
Across creative writing, dialogue, QA, and synthetic math data generation:
- VS improves semantic diversity by 1.6–2.1× in poems, 1.9–2.4× in stories, and up to 3× in jokes. Larger models benefit more (+12 percentage points diversity gain in GPT-4.1 vs +6 in smaller variants).
- VS achieves high-quality generations (equaling or exceeding direct prompting), does not erode factual accuracy or safety (>97% refusal on harmful prompts), and permits user-tunable diversity-quality tradeoff via explicit probability thresholds (Zhang et al., 1 Oct 2025).
- In math data, VS-based synthetic training boosts downstream accuracy on benchmarks by up to +4.7 pp over direct generation.
6. Comparative Analysis and Limitations
PMC is a distinct form of reward hacking driven by intrinsic reward model or data biases, rather than optimization failures alone. The D²-Align approach is plug-and-play (applicable to other RLHF pipelines), requires orders-of-magnitude fewer RL steps for alignment, and empirically breaks the fidelity–diversity tradeoff in text-to-image tasks. However, it currently relies on a single, frozen reward model; future improvements may require ensemble or adaptive reward models and higher-order embedding corrections.
In LLMs, typicality bias is both empirically pervasive and theoretically guaranteed to cause diversity collapse for any positive . VS is model-agnostic and API-only, but increases inference cost and offers less benefit for weaker models. Data-centric alignment, such as pluralistic reward modeling, and more exploration-oriented RL objectives remain as open research directions (Zhang et al., 1 Oct 2025).
7. Broader Implications and Future Directions
Preference Mode Collapse highlights a fundamental failure of current alignment paradigms—high automated preference scores do not guarantee genuine diversity or human-like creativity. Methodological proposals such as D²-Align and Verbalized Sampling supply practical mitigation, but further progress depends on more robust preference modeling, ensemble-based or continual adaptation to shifting annotation biases, and explicit modeling of diversity objectives. Extending benchmarks like DivGenBench to richer axes (e.g., object interaction, scene complexity) and cross-domain tasks is an ongoing challenge for the field (Chen et al., 30 Dec 2025, Zhang et al., 1 Oct 2025).