D²-Align: Bias Correction in Diffusion Models
- D²-Align is a two-stage framework designed to mitigate Preference Mode Collapse by correcting reward embedding biases in text-to-image diffusion models.
- It introduces a learned directional bias correction vector, which realigns textual embeddings and enhances output diversity without sacrificing human preference.
- Experimental results demonstrate that D²-Align outperforms baseline RL fine-tuning methods across key quality and diversity metrics.
Directional Decoupling Alignment (D²-Align) is a two-stage alignment framework developed to mitigate Preference Mode Collapse (PMC) in text-to-image diffusion models fine-tuned by Reinforcement Learning from Human Feedback (RLHF). PMC arises when the model exploits idiosyncrasies in reward models, producing high-reward but low-diversity outputs. D²-Align introduces a principled correction in the reward signal's embedding space, resulting in improved diversity without sacrificing alignment to human preferences (Chen et al., 30 Dec 2025).
1. Preference Mode Collapse: Characterization and Metrics
PMC refers to the collapse of a diffusion model’s output distribution onto narrow, high-reward modes when optimizing against RLHF-derived rewards. For a generator conditioned on prompt with prompt distribution , PMC manifests as , where outputs are narrowly concentrated, often with overexposed or monolithic styles.
To quantify PMC, the DivGenBench benchmark was introduced. It comprises 3,200 prompts, systematically spanning four atomic axes—Identity, Style, Layout, Tonal. Four custom diversity metrics measure degree of collapse:
| Metric | Assesses | Score Direction |
|---|---|---|
| Identity Divergence Score (IDS) | Embedding similarity of generated faces (ArcFace) | Lower = More diverse |
| Artistic Style Coverage (ASC) | Coverage across learned style feature space (CSD, IRS) | Higher = More diverse |
| Spatial Dispersion Index (SDI) | Layout variation (IoU/Hungarian object alignment) | Higher = More diverse |
| Photographic Variance Score (PVS) | Tonal/histogram spread (HSV, grayscale) | Higher = More diverse |
These metrics collectively capture diversity loss along axes of semantic identity, stylistic rendering, compositional arrangement, and photographic treatment.
2. Reward Model Bias and Induction of PMC
RLHF reward models such as HPS-v2.1 are constructed atop CLIP-like embedding models, computing scores as , with . However, pretraining data biases—such as preference for glossy, overexposed, or saturated visuals—cause these reward models to systematically favor certain appearances regardless of true semantic match. Fine-tuning a diffusion model solely to maximize induces overoptimization toward these bias-aligned regions of embedding space, reducing both output heterogeneity and alignment with nuanced human preference.
3. Directional Correction in Embedding Space
D²-Align introduces a learned correction in the text embedding space to counteract intrinsic reward model biases. Define and . A correction vector is iteratively learned to shift the text embedding away from biased reward directions:
- Perturbed embeddings:
- Guided embedding (with ):
- The guided reward:
Learning involves minimizing under a frozen generator, effectively estimating the bias correction direction.
4. The D²-Align Two-Stage Algorithm
D²-Align operationalizes alignment via two sequential training phases:
Stage 1: Bias Correction Learning
- The generator is held frozen.
- Iteratively, for sampled , generate , extract embeddings, compute , and update via gradient descent to minimize the negative guided reward.
- After steps, is retained as the learned bias direction.
Stage 2: Generator Alignment to Guided Reward
- is unfrozen.
- For each prompt , synthesize and compute using .
- The generator is updated via RL to maximize the guided reward, aligning outputs under the corrected reward signal.
Both phases maintain the originally trained reward model and generator architectures, ensuring compatibility with standard RLHF workflows.
5. Experimental Protocol and Baseline Comparison
Experiments used FLUX 1.Dev as the backbone diffusion model. Two reward settings were evaluated: HPS-v2.1 alone and HPS-v2.1 + CLIP score. Quality was assessed with HPDv2 (human-aligned) prompts via metrics such as Q-Align (Aesthetic Score), ImageReward, PickScore, CLIP-Score, DeQA, and GenEval. Diversity was evaluated on DivGenBench along all four axes, using IDS, ASC, SDI, and PVS.
Baseline methods included DanceGRPO, Flow-GRPO, SRPO, and the unaligned FLUX backbone. All models ran on NVIDIA H20 GPUs, with 3,000 steps for Stage 1 and 20 for Stage 2.
6. Quantitative and Qualitative Outcomes
On HPDv2, D²-Align matched or outperformed all RL baselines in human preference alignment and semantic consistency:
| Method | Aesthetic↑ | ImageR↑ | Pick↑ | Q-Align↑ | DeQA↑ | GenEval↑ |
|---|---|---|---|---|---|---|
| D²-Align | 6.450 | 1.771 | 0.246 | 4.969 | 4.484 | 0.636 |
| FLUX | 6.417 | 1.670 | 0.240 | 4.922 | 4.456 | 0.663 |
| DanceGRPO | 6.068 | 1.664 | 0.241 | 4.930 | 4.400 | 0.522 |
| Flow-GRPO | 5.888 | 1.703 | 0.239 | 4.969 | 4.432 | 0.517 |
| SRPO | 6.614 | 1.533 | 0.241 | 4.866 | 4.357 | 0.623 |
For diversity (DivGenBench):
| Method | IDS↓ | ASC↑ | SDI↑ | PVS↑ |
|---|---|---|---|---|
| D²-Align | 0.251 | 0.253 | 0.636 | 0.412 |
| FLUX | 0.280 | 0.179 | 0.563 | 0.408 |
| DanceGRPO | 0.348 | 0.130 | 0.488 | 0.259 |
| Flow-GRPO | 0.391 | 0.044 | 0.389 | 0.168 |
| SRPO | 0.259 | 0.234 | 0.580 | 0.352 |
D²-Align attained either the leading or joint-best scores across all diversity axes, successfully mitigating PMC observed in other RL-based finetuning regimes.
Qualitatively, D²-Align preserved key subject identities, faithfully delivered diverse artistic styles, and respected prompt-specified layouts and tonal instructions in contrast to the mode-collapsed baseline outputs. User studies on HPDv2 yielded the highest win rates for D²-Align regarding detail preservation (61.7%), image-text alignment (52.2%), and overall preference (48.2%), as well as dominant diversity selections on DivGenBench (e.g., 37.3% win rate in Style, 35.2% in Identity).
7. Implications and Context
D²-Align demonstrates that PMC, induced via biased reward maximization in RLHF for diffusion, can be substantially mitigated by explicit bias-correcting interventions in the reward embedding space. The approach is decoupled from generator and reward model architectures, instead requiring only an additional directional alignment vector and associated two-stage optimization. This suggests D²-Align is adaptable to future reward models exhibiting different or evolving biases. A plausible implication is broader applicability to other generative domains utilizing embedding-based reward signals, where mode collapse remains a practical limitation (Chen et al., 30 Dec 2025).