Directional Decoupling Alignment (D²-Align)
- The paper introduces D²-Align, a novel framework that controls Preference Mode Collapse by learning a directional correction vector to adjust the reward signal.
- D²-Align decouples the generator’s behavior from intrinsic reward model biases through a two-stage training process, ensuring both human preference alignment and output diversity.
- Empirical results on DivGenBench and human evaluations demonstrate that D²-Align outperforms existing RL baselines in both alignment metrics and diversity scores.
Directional Decoupling Alignment (D-Align) is a framework for controlling Preference Mode Collapse (PMC) in text-to-image (T2I) diffusion reinforcement learning (RL). PMC arises from the over-optimization of reward models with intrinsic biases, causing models to produce a narrow set of high-reward but low-diversity outputs. D-Align addresses this by learning a continuous, prompt-embedding–space correction vector applied directionally to the reward signal, decoupling the generator’s behavior from biases in the reward model and preserving both alignment to human preferences and diversity in generated samples (Chen et al., 30 Dec 2025).
1. Preference Mode Collapse and Its Quantification
Preference Mode Collapse (PMC) is a particular manifestation of reward hacking in which RL-fine-tuned T2I diffusion models converge on narrow, reward-favored output modes, such as a single highly stylized “over-exposed” image style, at the expense of diversity. This phenomenon is driven by inherent “favorite” modes in the reward model; naive maximization overfits to these biases, leading to catastrophic loss in generative spread.
Quantification of PMC is provided by DivGenBench, a benchmark of 3,200 prompts that probe four orthogonal diversity axes:
- Identity (ID): Age, ethnicity, gender, facial features, sourced from CelebA.
- Artistic Style (Style): Referenced from painting styles in WikiArt.
- Layout: Object count and spatial arrangement, using COCO-style metadata.
- Tonal Properties (Tonal): Saturation, brightness, and contrast levels.
For each dimension, bespoke metrics are defined:
| Dimension | Metric | Formula/Process | Direction |
|---|---|---|---|
| Identity | IDS | Lower is better | |
| Style | ASC | Higher is better | |
| Layout | SDI | Averaged 1 minus pairwise box layout similarity (see below) | Higher is better |
| Tonal | PVS | Higher is better |
Here, IDS employs face embeddings (ArcFace) to quantify crowding in the identity space, ASC uses a style retrieval process against WikiArt, SDI leverages Grounding DINO for object layout, and PVS is a sum of variances in basic tone statistics.
2. Mathematical Framework of D-Align
The framework consists of two main stages, formalized as follows.
Let denote the T2I generator (diffusion model) and
be the reward model, where , are frozen encoders and the is cosine similarity between image and prompt embeddings.
One-step denoising produces the clean image by sampling , forming , predicting noise , and using
2.1 Stage 1: Learning the Directional Correction Vector
- Introduce a learnable correction vector ; freeze .
- For each prompt , define prompt embeddings:
- Construct the guided embedding with classifier-free–style extrapolation and scale :
- Compute the guided reward:
- Train to maximize expected :
- Optimize for steps, resulting in .
2.2 Stage 2: Guided Generator Alignment
- Freeze and update via RL with the guided reward.
- The net reward function becomes
where . This shapes the reward directionally.
No regularization term is used other than normalization of embedding vectors and alternation of freeze states.
3. Algorithmic Workflow
The two-stage process is concisely formalized in the following workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Initialize b_v ← random for t in 1…T₁: c ← sample(𝒞) x₀ ← G_θ(c) ε_gt ∼ N(0, I); x_t = α_t x₀ + σ_t ε_gt ε_pred = ε_θ(x_t, t); x̂₀ via one-step denoise e_img = Φ_img(x̂₀); e_text = Φ_text(c) e_± = normalize(e_text ± b_v) ê_text = e_- + ω*(e_+ – e_-) R_guided = score(e_img, ê_text) update b_v ← b_v – η ∇_{b_v}[–R_guided] b_v^* ← b_v for t in 1…T₂: c ← sample(𝒞) x₀ ← G_θ(c) ε_gt ∼ N(0, I); x_t = α_t x₀ + σ_t ε_gt ε_pred = ε_θ(x_t, t); x̂₀ via denoise e_img = Φ_img(x̂₀); e_text = Φ_text(c) e_± = normalize(e_text ± b_v^*) ê_text = e_- + ω*(e_+ – e_-) R_guided = score(e_img, ê_text) update θ ← θ – η ∇_θ[–R_guided] |
This two-stage alternation—learning the correction direction on a frozen generator, then applying it while training the generator—distinguishes D-Align from previous approaches.
4. Empirical Evaluation
Alignment and diversity were quantified using both standard and newly proposed metrics. D-Align consistently matched or exceeded all RL baselines (DanceGRPO, Flow-GRPO, SRPO, and FLUX) in both reward-alignment and diversity on DivGenBench.
4.1 Automated Reward Scores
- Under HPS-v2.1 reward, D-Align achieved or tied for best:
- Under HPS-v2.1+CLIP, D-Align was best on all metrics, including Aesthetic (6.671), ImageReward (1.762), PickScore (0.246), Q-Align (4.970), CLIP Score (0.328), DeQA (4.498), GenEval (0.660).
4.2 Diversity Results (DivGenBench)
- Identity Divergence Score (IDS): D-Align 0.251 (HPS-v2.1), 0.237 (HPS-v2.1+CLIP) — lowest/best in both cases.
- Artistic Style Coverage (ASC): 0.253 (HPS-v2.1), 0.247 (HPS-v2.1+CLIP) — highest/best.
- Spatial Dispersion Index (SDI): 0.636 and 0.631.
- Photographic Variance Score (PVS): 0.412 and 0.418.
Compared to prior baselines, D-Align achieved uniformly better diversity and did not trade off preference alignment to obtain it.
4.3 Ablations and Human Evaluation
- Convergence of the correction vector occurred within approximately 2,000 steps in Stage 1.
- Optimal guidance scale was observed at .
- A continuous, learned outperformed discrete token-based alternatives.
- Incorporating as a plug-in to DanceGRPO improved both alignment and diversity metrics.
- Human preference studies revealed D-Align was selected in ~48.2% of overall HPDv2 cases and was preferred on every DivGenBench diversity axis (Identity, Style, Layout, Tonal).
5. Mechanistic Insights and Applicability
D-Align operates by shifting the direction of the reward gradient in prompt embedding space, rather than scaling its magnitude. This directional shaping distinguishes it from conventional penalty or regularization schemes and directly decouples the generator's optimization trajectory from reward-model–favored modes that drive PMC.
Applying D-Align to other diffusion RL tasks involves the same general workflow:
- Freeze the generative policy and learn a prompt-embedding correction vector on ground-truth or human-labeled data.
- Freeze the correction vector and RL-finetune the generator under the guided reward.
- Evaluate outputs for both alignment (automated/human) and diversity (DivGenBench or analogous metrics).
A plausible implication is that directional decoupling constitutes a general countermeasure against reward-model “mode biases” across a range of alignment domains, not limited to text-to-image diffusion RL.
6. Significance and Perspectives
Directional Decoupling Alignment provides an operational methodology for preserving the diversity of generative models while maintaining high human preference scores, explicitly breaking the quality–diversity trade-off that commonly afflicts RL from human feedback in diffusion models. By leveraging a learned, continuous correction vector in embedding space, D-Align remains free of hand-designed regularizers or constraints.
Overall, the approach demonstrates that shaping the reward in direction (not just in value) is a viable and practical strategy for addressing preference-induced collapse (Chen et al., 30 Dec 2025).