Directional Decoupling Alignment (D²-Align)

Updated 6 January 2026

The paper introduces D²-Align, a novel framework that controls Preference Mode Collapse by learning a directional correction vector to adjust the reward signal.
D²-Align decouples the generator’s behavior from intrinsic reward model biases through a two-stage training process, ensuring both human preference alignment and output diversity.
Empirical results on DivGenBench and human evaluations demonstrate that D²-Align outperforms existing RL baselines in both alignment metrics and diversity scores.

Directional Decoupling Alignment (D $^2$ -Align) is a framework for controlling Preference Mode Collapse (PMC) in text-to-image (T2I) diffusion reinforcement learning (RL). PMC arises from the over-optimization of reward models with intrinsic biases, causing models to produce a narrow set of high-reward but low-diversity outputs. D $^2$ -Align addresses this by learning a continuous, prompt-embedding–space correction vector applied directionally to the reward signal, decoupling the generator’s behavior from biases in the reward model and preserving both alignment to human preferences and diversity in generated samples (Chen et al., 30 Dec 2025).

1. Preference Mode Collapse and Its Quantification

Preference Mode Collapse (PMC) is a particular manifestation of reward hacking in which RL-fine-tuned T2I diffusion models converge on narrow, reward-favored output modes, such as a single highly stylized “over-exposed” image style, at the expense of diversity. This phenomenon is driven by inherent “favorite” modes in the reward model; naive maximization overfits to these biases, leading to catastrophic loss in generative spread.

Quantification of PMC is provided by DivGenBench, a benchmark of 3,200 prompts that probe four orthogonal diversity axes:

Identity (ID): Age, ethnicity, gender, facial features, sourced from CelebA.
Artistic Style (Style): Referenced from painting styles in WikiArt.
Layout: Object count and spatial arrangement, using COCO-style metadata.
Tonal Properties (Tonal): Saturation, brightness, and contrast levels.

For each dimension, bespoke metrics are defined:

Dimension	Metric	Formula/Process	Direction
Identity	IDS	$\frac{2}{N(N-1)}\sum_{i=1}^N\sum_{j>i} \frac{v_i\cdot v_j}{\\|v_i\\|\\|v_j\\|}$	Lower is better
Style	ASC	$\frac{\text{IRS}_\infty(\mathcal{X}_{\mathrm{synth}})}{\text{IRS}_\infty(\mathcal{X}_{\mathrm{test}})}$	Higher is better
Layout	SDI	Averaged 1 minus pairwise box layout similarity (see below)	Higher is better
Tonal	PVS	$\mathrm{std}(\mathbf{s}) + \mathrm{std}(\mathbf{v}) + \mathrm{std}(\mathbf{c})$	Higher is better

Here, IDS employs face embeddings (ArcFace) to quantify crowding in the identity space, ASC uses a style retrieval process against WikiArt, SDI leverages Grounding DINO for object layout, and PVS is a sum of variances in basic tone statistics.

2. Mathematical Framework of D $^2$ -Align

The framework consists of two main stages, formalized as follows.

Let $G_\theta$ denote the T2I generator (diffusion model) and

$R(x_0, c) = \mathrm{score}\left( \Phi_{\mathrm{img}}(x_0),\, \Phi_{\mathrm{text}}(c) \right)$

be the reward model, where $\Phi_{\mathrm{img}}$ , $\Phi_{\mathrm{text}}$ are frozen encoders and the $\mathrm{score}(\cdot,\cdot)$ is cosine similarity between image and prompt embeddings.

One-step denoising produces the clean image $\hat{x}_0$ by sampling $\epsilon_{\mathrm{gt}}\sim \mathcal{N}(0, I)$ , forming $x_t = \alpha_t x_0 + \sigma_t \epsilon_{\mathrm{gt}}$ , predicting noise $\hat{\epsilon} = \epsilon_\theta(x_t, t)$ , and using

$\hat{x}_0 = \frac{x_t - \sigma_t \hat{\epsilon}}{\alpha_t}$

2.1 Stage 1: Learning the Directional Correction Vector

Introduce a learnable correction vector $\mathbf{b}_v \in \mathbb{R}^d$ ; freeze $G_\theta$ .
For each prompt $c$ , define prompt embeddings:

$\mathbf{e}_+ = \mathrm{normalize}(\mathbf{e}_{\mathrm{text}} + \mathbf{b}_v),\quad \mathbf{e}_- = \mathrm{normalize}(\mathbf{e}_{\mathrm{text}} - \mathbf{b}_v)$

Construct the guided embedding with classifier-free–style extrapolation and scale $\omega > 1$ :

$\tilde{\mathbf{e}}_{\mathrm{text}} = \mathbf{e}_- + \omega \left( \mathbf{e}_+ - \mathbf{e}_- \right)$

Compute the guided reward:

$R_{\mathrm{guided}}(x_0, c; \mathbf{b}_v) = \mathrm{score}( \Phi_{\mathrm{img}}(x_0),\, \tilde{\mathbf{e}}_{\mathrm{text}} )$

Train $\mathbf{b}_v$ to maximize expected $R_{\mathrm{guided}}$ :

$\mathcal{L}_{\mathrm{stage1}}(\mathbf{b}_v ) = \mathbb{E}_{c, x_0 \sim G_\theta^{\mathrm{frozen}}} [ - R_{\mathrm{guided}}(x_0, c; \mathbf{b}_v) ]$

Optimize $\mathbf{b}_v$ for $T_1$ steps, resulting in $\mathbf{b}_v^*$ .

2.2 Stage 2: Guided Generator Alignment

Freeze $\mathbf{b}_v^*$ and update $G_\theta$ via RL with the guided reward.

$\mathcal{L}_{\mathrm{stage2}}(\theta) = \mathbb{E}_{c \sim \mathcal{D},\, x_0 \sim G_\theta(c)} [ -R_{\mathrm{guided}}(x_0, c; \mathbf{b}_v^*) ]$

The net reward function becomes

$r'(x, c) = r(x, c) + \Delta r(x, c),\quad \Delta r(x, c) = \tilde{r}(x, c) - r(x, c)$

where $\tilde{r}(x, c) = R_{\mathrm{guided}}(x, c; \mathbf{b}_v^*)$ . This shapes the reward directionally.

No regularization term is used other than normalization of embedding vectors and alternation of freeze states.

3. Algorithmic Workflow

The two-stage process is concisely formalized in the following workflow:

Initialize b_v ← random
for t in 1…T₁:
    c ← sample(𝒞)
    x₀ ← G_θ(c)
    ε_gt ∼ N(0, I); x_t = α_t x₀ + σ_t ε_gt
    ε_pred = ε_θ(x_t, t); x̂₀ via one-step denoise
    e_img = Φ_img(x̂₀); e_text = Φ_text(c)
    e_± = normalize(e_text ± b_v)
    ê_text = e_- + ω*(e_+ – e_-)
    R_guided = score(e_img, ê_text)
    update b_v ← b_v – η ∇_{b_v}[–R_guided]
b_v^* ← b_v

for t in 1…T₂:
    c ← sample(𝒞)
    x₀ ← G_θ(c)
    ε_gt ∼ N(0, I); x_t = α_t x₀ + σ_t ε_gt
    ε_pred = ε_θ(x_t, t); x̂₀ via denoise
    e_img = Φ_img(x̂₀); e_text = Φ_text(c)
    e_± = normalize(e_text ± b_v^*)
    ê_text = e_- + ω*(e_+ – e_-)
    R_guided = score(e_img, ê_text)
    update θ ← θ – η ∇_θ[–R_guided]

This two-stage alternation—learning the correction direction on a frozen generator, then applying it while training the generator—distinguishes D $^2$ -Align from previous approaches.

4. Empirical Evaluation

Alignment and diversity were quantified using both standard and newly proposed metrics. D $^2$ -Align consistently matched or exceeded all RL baselines (DanceGRPO, Flow-GRPO, SRPO, and FLUX) in both reward-alignment and diversity on DivGenBench.

4.1 Automated Reward Scores

Under HPS-v2.1 reward, D $^2$ $^{2}$ -Align achieved or tied for best:
- Aesthetic: 6.450 (2nd)
- ImageReward: 1.771 (best)
- PickScore: 0.246 (best)
- Q-Align: 4.969 (tied best)
- CLIP, DeQA, GenEval: among top scores
Under HPS-v2.1+CLIP, D $^2$ -Align was best on all metrics, including Aesthetic (6.671), ImageReward (1.762), PickScore (0.246), Q-Align (4.970), CLIP Score (0.328), DeQA (4.498), GenEval (0.660).

4.2 Diversity Results (DivGenBench)

Identity Divergence Score (IDS): D $^2$ -Align 0.251 (HPS-v2.1), 0.237 (HPS-v2.1+CLIP) — lowest/best in both cases.
Artistic Style Coverage (ASC): 0.253 (HPS-v2.1), 0.247 (HPS-v2.1+CLIP) — highest/best.
Spatial Dispersion Index (SDI): 0.636 and 0.631.
Photographic Variance Score (PVS): 0.412 and 0.418.

Compared to prior baselines, D $^2$ -Align achieved uniformly better diversity and did not trade off preference alignment to obtain it.

4.3 Ablations and Human Evaluation

Convergence of the correction vector $\mathbf{b}_v$ occurred within approximately 2,000 steps in Stage 1.
Optimal guidance scale was observed at $\omega=1.5$ .
A continuous, learned $\mathbf{b}_v$ outperformed discrete token-based alternatives.
Incorporating $\mathbf{b}_v^*$ as a plug-in to DanceGRPO improved both alignment and diversity metrics.
Human preference studies revealed D $^2$ -Align was selected in ~48.2% of overall HPDv2 cases and was preferred on every DivGenBench diversity axis (Identity, Style, Layout, Tonal).

5. Mechanistic Insights and Applicability

D $^2$ -Align operates by shifting the direction of the reward gradient in prompt embedding space, rather than scaling its magnitude. This directional shaping distinguishes it from conventional penalty or regularization schemes and directly decouples the generator's optimization trajectory from reward-model–favored modes that drive PMC.

Applying D $^2$ -Align to other diffusion RL tasks involves the same general workflow:

Freeze the generative policy and learn a prompt-embedding correction vector on ground-truth or human-labeled data.
Freeze the correction vector and RL-finetune the generator under the guided reward.
Evaluate outputs for both alignment (automated/human) and diversity (DivGenBench or analogous metrics).

A plausible implication is that directional decoupling constitutes a general countermeasure against reward-model “mode biases” across a range of alignment domains, not limited to text-to-image diffusion RL.

6. Significance and Perspectives

Directional Decoupling Alignment provides an operational methodology for preserving the diversity of generative models while maintaining high human preference scores, explicitly breaking the quality–diversity trade-off that commonly afflicts RL from human feedback in diffusion models. By leveraging a learned, continuous correction vector in embedding space, D $^2$ -Align remains free of hand-designed regularizers or constraints.

Overall, the approach demonstrates that shaping the reward in direction (not just in value) is a viable and practical strategy for addressing preference-induced collapse (Chen et al., 30 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Taming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Directional Decoupling Alignment (D$^2$-Align).

Directional Decoupling Alignment (D²-Align)

1. Preference Mode Collapse and Its Quantification

2. Mathematical Framework of D $^2$ -Align

2.1 Stage 1: Learning the Directional Correction Vector

2.2 Stage 2: Guided Generator Alignment

3. Algorithmic Workflow

4. Empirical Evaluation

4.1 Automated Reward Scores

4.2 Diversity Results (DivGenBench)

4.3 Ablations and Human Evaluation

5. Mechanistic Insights and Applicability

6. Significance and Perspectives

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Directional Decoupling Alignment (D²-Align)

1. Preference Mode Collapse and Its Quantification

2. Mathematical Framework of D2^22-Align

2.1 Stage 1: Learning the Directional Correction Vector

2.2 Stage 2: Guided Generator Alignment

3. Algorithmic Workflow

4. Empirical Evaluation

4.1 Automated Reward Scores

4.2 Diversity Results (DivGenBench)

4.3 Ablations and Human Evaluation

5. Mechanistic Insights and Applicability

6. Significance and Perspectives

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

2. Mathematical Framework of D $^2$ -Align