D²-Align: Bias Correction in Diffusion Models

Updated 19 March 2026

D²-Align is a two-stage framework designed to mitigate Preference Mode Collapse by correcting reward embedding biases in text-to-image diffusion models.
It introduces a learned directional bias correction vector, which realigns textual embeddings and enhances output diversity without sacrificing human preference.
Experimental results demonstrate that D²-Align outperforms baseline RL fine-tuning methods across key quality and diversity metrics.

Directional Decoupling Alignment (D²-Align) is a two-stage alignment framework developed to mitigate Preference Mode Collapse (PMC) in text-to-image diffusion models fine-tuned by Reinforcement Learning from Human Feedback (RLHF). PMC arises when the model exploits idiosyncrasies in reward models, producing high-reward but low-diversity outputs. D²-Align introduces a principled correction in the reward signal's embedding space, resulting in improved diversity without sacrificing alignment to human preferences (Chen et al., 30 Dec 2025).

1. Preference Mode Collapse: Characterization and Metrics

PMC refers to the collapse of a diffusion model’s output distribution onto narrow, high-reward modes when optimizing against RLHF-derived rewards. For a generator $G_\theta(c)$ conditioned on prompt $c$ with prompt distribution $\mathcal{D}$ , PMC manifests as $p_\theta(x_0|c) \rightarrow \delta(x_0 \approx x^*)$ , where outputs $x^*$ are narrowly concentrated, often with overexposed or monolithic styles.

To quantify PMC, the DivGenBench benchmark was introduced. It comprises 3,200 prompts, systematically spanning four atomic axes—Identity, Style, Layout, Tonal. Four custom diversity metrics measure degree of collapse:

Metric	Assesses	Score Direction
Identity Divergence Score (IDS)	Embedding similarity of generated faces (ArcFace)	Lower = More diverse
Artistic Style Coverage (ASC)	Coverage across learned style feature space (CSD, IRS $_\infty$ )	Higher = More diverse
Spatial Dispersion Index (SDI)	Layout variation (IoU/Hungarian object alignment)	Higher = More diverse
Photographic Variance Score (PVS)	Tonal/histogram spread (HSV, grayscale)	Higher = More diverse

These metrics collectively capture diversity loss along axes of semantic identity, stylistic rendering, compositional arrangement, and photographic treatment.

2. Reward Model Bias and Induction of PMC

RLHF reward models such as HPS-v2.1 are constructed atop CLIP-like embedding models, computing scores as $R(x_0, c) = \mathrm{score}(\Phi_{\text{img}}(x_0), \Phi_{\text{text}}(c))$ , with $\mathrm{score}(u, v) = \cos(u, v)$ . However, pretraining data biases—such as preference for glossy, overexposed, or saturated visuals—cause these reward models to systematically favor certain appearances regardless of true semantic match. Fine-tuning a diffusion model solely to maximize $R$ induces overoptimization toward these bias-aligned regions of embedding space, reducing both output heterogeneity and alignment with nuanced human preference.

3. Directional Correction in Embedding Space

D²-Align introduces a learned correction in the text embedding space to counteract intrinsic reward model biases. Define $\mathbf{e}_{\text{text}} = \Phi_{\text{text}}(c) \in \mathbb{R}^d$ and $\mathbf{e}_{\text{img}} = \Phi_{\text{img}}(x_0)$ . A correction vector $\mathbf{b}_v \in \mathbb{R}^d$ is iteratively learned to shift the text embedding away from biased reward directions:

Perturbed embeddings:
- $\mathbf{e}_{+} = \mathrm{normalize}(\mathbf{e}_{\text{text}} + \mathbf{b}_v)$
- $\mathbf{e}_{-} = \mathrm{normalize}(\mathbf{e}_{\text{text}} - \mathbf{b}_v)$
Guided embedding (with $\omega > 1$ ):

$\tilde{\mathbf{e}}_{\text{text}} = \mathbf{e}_{-} + \omega (\mathbf{e}_{+} - \mathbf{e}_{-})$

The guided reward: $R_{\text{guided}}(x_0, c; \mathbf{b}_v) = \mathrm{score}(\mathbf{e}_{\text{img}}, \tilde{\mathbf{e}}_{\text{text}})$

Learning $\mathbf{b}_v$ involves minimizing $-\mathbb{E}_{c, x_0} [R_{\text{guided}}(x_0, c; \mathbf{b}_v)]$ under a frozen generator, effectively estimating the bias correction direction.

4. The D²-Align Two-Stage Algorithm

D²-Align operationalizes alignment via two sequential training phases:

Stage 1: Bias Correction Learning

The generator $G_\theta$ is held frozen.
Iteratively, for sampled $c$ , generate $x_0$ , extract embeddings, compute $R_{\text{guided}}$ , and update $\mathbf{b}_v$ via gradient descent to minimize the negative guided reward.
After $T_1$ steps, $\mathbf{b}_v^*$ is retained as the learned bias direction.

Stage 2: Generator Alignment to Guided Reward

$G_\theta$ is unfrozen.
For each prompt $c$ , synthesize $x_0$ and compute $R_{\text{guided}}$ using $\mathbf{b}_v^*$ .
The generator is updated via RL to maximize the guided reward, aligning outputs under the corrected reward signal.

Both phases maintain the originally trained reward model and generator architectures, ensuring compatibility with standard RLHF workflows.

5. Experimental Protocol and Baseline Comparison

Experiments used FLUX 1.Dev as the backbone diffusion model. Two reward settings were evaluated: HPS-v2.1 alone and HPS-v2.1 + CLIP score. Quality was assessed with HPDv2 (human-aligned) prompts via metrics such as Q-Align (Aesthetic Score), ImageReward, PickScore, CLIP-Score, DeQA, and GenEval. Diversity was evaluated on DivGenBench along all four axes, using IDS, ASC, SDI, and PVS.

Baseline methods included DanceGRPO, Flow-GRPO, SRPO, and the unaligned FLUX backbone. All models ran on NVIDIA H20 GPUs, with 3,000 steps for Stage 1 and 20 for Stage 2.

6. Quantitative and Qualitative Outcomes

On HPDv2, D²-Align matched or outperformed all RL baselines in human preference alignment and semantic consistency:

Method	Aesthetic↑	ImageR↑	Pick↑	Q-Align↑	DeQA↑	GenEval↑
D²-Align	6.450	1.771	0.246	4.969	4.484	0.636
FLUX	6.417	1.670	0.240	4.922	4.456	0.663
DanceGRPO	6.068	1.664	0.241	4.930	4.400	0.522
Flow-GRPO	5.888	1.703	0.239	4.969	4.432	0.517
SRPO	6.614	1.533	0.241	4.866	4.357	0.623

For diversity (DivGenBench):

Method	IDS↓	ASC↑	SDI↑	PVS↑
D²-Align	0.251	0.253	0.636	0.412
FLUX	0.280	0.179	0.563	0.408
DanceGRPO	0.348	0.130	0.488	0.259
Flow-GRPO	0.391	0.044	0.389	0.168
SRPO	0.259	0.234	0.580	0.352

D²-Align attained either the leading or joint-best scores across all diversity axes, successfully mitigating PMC observed in other RL-based finetuning regimes.

Qualitatively, D²-Align preserved key subject identities, faithfully delivered diverse artistic styles, and respected prompt-specified layouts and tonal instructions in contrast to the mode-collapsed baseline outputs. User studies on HPDv2 yielded the highest win rates for D²-Align regarding detail preservation (61.7%), image-text alignment (52.2%), and overall preference (48.2%), as well as dominant diversity selections on DivGenBench (e.g., 37.3% win rate in Style, 35.2% in Identity).

7. Implications and Context

D²-Align demonstrates that PMC, induced via biased reward maximization in RLHF for diffusion, can be substantially mitigated by explicit bias-correcting interventions in the reward embedding space. The approach is decoupled from generator and reward model architectures, instead requiring only an additional directional alignment vector and associated two-stage optimization. This suggests D²-Align is adaptable to future reward models exhibiting different or evolving biases. A plausible implication is broader applicability to other generative domains utilizing embedding-based reward signals, where mode collapse remains a practical limitation (Chen et al., 30 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Directional Decoupling Alignment (D²-Align).