Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

Published 26 May 2026 in cs.LG and cs.CV | (2605.26491v1)

Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces Diffusion LAIR, a listwise reward-aware framework that leverages full reward-labeled candidate sets to optimize model alignment beyond pairwise comparisons.
It employs a temperature-controlled softmax to convert reward scores into centered advantage weights, ensuring efficient distributed learning signals and stability.
Empirical evaluations show that LAIR outperforms state-of-the-art methods in text-to-image, compositionality, and image editing tasks while reducing GPU usage by up to 5x.

Listwise Reward-Aware Alignment for Diffusion Models: An Authoritative Analysis

Motivation and Context

Preference optimization is a widely adopted post-training methodology for aligning text-to-image diffusion models with human preferences in domains such as aesthetics, semantic fidelity, and instruction-following. While large-scale pretraining achieves remarkable sample quality, it does not guarantee the model follows nuanced human intent. Traditional offline preference optimization methods such as DPO and DSPO rely on pairwise preference data (binary comparison between two images given a prompt), thereby discarding richer supervision available in practical datasets that feature multiple candidate images and continuous reward scores per prompt.

This paper introduces Diffusion LAIR, a listwise, reward-aware optimization framework that leverages the full structure of reward-labeled candidate sets for each prompt. LAIR generalizes preference alignment from pairwise to listwise supervision, explicitly exploiting reward magnitudes and rankings to refine learning signals. This approach addresses the information loss in pairwise reduction and aims to provide both theoretical and empirical improvements in diffusion model alignment.

Methodology

The core innovation in LAIR lies in its objective formulation. Rather than reducing supervision to isolated winner-loser pairs, LAIR converts reward scores across a candidate set into centered advantage weights using a temperature-controlled softmax. These weights indicate the relative quality of each candidate, are zero-sum within each group, and robustly distinguish high- and low-reward samples.

The objective consists of an advantage-weighted regression on the implicit reward, defined as the denoising-loss improvement of the current model over a reference model, with a quadratic regularization term that conservatively penalizes deviation:

$\mathcal{L}_{\text{LAIR}}(\theta) = \mathbb{E}_{\text{prompt},\, t,\, \{\epsilon_i\}}\left[-\sum_{i=1}^{N_c} w_i\,s_\theta^{(i)} + \frac{\lambda}{N_c}\sum_{i=1}^{N_c}(s_\theta^{(i)})^2\right]$

where $w_i$ are center-normalized advantage weights, $s_\theta^{(i)}$ the implicit reward for candidate $i$ , and $\lambda$ the regularization strength. This formulation ensures that learning signal is distributed among all candidates, reflecting both their ordinal and cardinal reward structure, while regularization prevents aggressive shifts away from the reference and loss of sample diversity.

Theoretical analysis demonstrates that the objective admits a closed-form optimum for implicit reward, proportional to the advantage weights and scaled by regularization. The range of optimal values is tightly bounded, providing intuition (via a surrogate KL bound) into how the regularization parameter controls the effective shift in the learned distribution relative to the reference.

LAIR does not require full denoising trajectories or online RL, enabling efficient, highly parallelizable training directly on offline groups of reward-scored images.

Empirical Results

Diffusion LAIR is evaluated by fine-tuning both SD1.5 and SDXL models on the Pick-a-Pic v2 dataset using vectorized listwise supervision. Evaluation spans three domains: text-to-image generation, compositional generation (GenEval), and instruction-based image editing (InstructPix2Pix). Metrics include PickScore, HPS v2, LAION Aesthetics, CLIP, and ImageReward.

Strong numerical results highlight LAIR’s empirical superiority:

T2I Alignment: LAIR consistently outperforms SOTA baselines (Diffusion DPO, DSPO, Diffusion KTO, MaPO, InPO, CRAFT, SmPO) across reward metrics, including on unseen reward models—demonstrating generality and robustness to the specific reward model used in training.
GenEval Compositionality: LAIR achieves higher overall and category scores (color, count, object position, attribute, multi-object) than all baselines for both SD1.5 and SDXL, indicating improved compositional reasoning and visual adherence to prompt.
Image Editing: InstructPix2Pix win rates against SDXL are highest for LAIR across most reward categories.
Efficiency: LAIR’s offline approach reduces required GPU hours by a factor of up to 5 compared to online RL-based methods such as Diffusion DPO and DSPO, enabling rapid, scalable fine-tuning on large models.

Ablation studies confirm that larger candidate groups ( $N$ ) and lower softmax temperature ( $\tau$ ) provide incremental benefits, but LAIR is not unduly sensitive to these hyperparameters.

Theoretical Implications

The paper presents a rigorous theoretical analysis centered on the convexity and closed-form optimality of the LAIR objective. Key findings include:

Bounded Implicit Reward: Unlike pairwise objectives, LAIR’s finite closed-form solution prevents extreme probability mass redistribution, preserving ranking structure and mitigating noise sensitivity.
Surrogate KL Bound: Under standard log-ratio approximation assumptions, the regularization strength $\lambda$ offers direct control on the induced KL divergence between the learned and reference distributions, enhancing interpretability and stability.
Zero-Sum Reward Redistribution: LAIR’s objective does not attempt to raise preference across all samples indiscriminately, but instead reallocates probability mass within the candidate set according to their relative reward—focusing alignment on meaningful quality improvements.

These properties provide a principled justification for listwise, reward-aware preference optimization, marking a departure from the limitations of pairwise structures and heuristic pair selection.

Practical Significance and Future Directions

Practically, Diffusion LAIR’s efficiency, empirical superiority, and robustness recommend it as a viable standard for diffusion model preference alignment. The framework is compatible with diverse scales (from SD1.5 to SDXL), reward models, and application domains (T2I, compositionality, image editing).

Theoretically, LAIR’s advantage-weighted, listwise structure aligns with modern understanding of ranking and reward in supervised preference learning, paving the way for integration with richer reward signals and multitask preference frameworks. Its robust distributional control suggests potential for further development in conservative alignment settings and safety-constrained generative modeling.

Future avenues include:

Extending listwise preference optimization to multimodal or multi-turn settings
Joint learning of reward models and alignment objectives in end-to-end fashion
Incorporating adversarial or oracle-based reward signals to reduce reliance on potentially misaligned reward models
Exploration of scalable training on web-scale datasets with massive candidate sets per prompt
Application to other diffusion architectures beyond text-to-image synthesis

Conclusion

Diffusion LAIR introduces a listwise, reward-aware objective for offline preference optimization in diffusion models, leveraging the full structure of candidate group rewards to distribute learning signal efficiently and conservatively. Empirical and theoretical evidence confirm its superiority over traditional pairwise methods in both quality and efficiency. The approach offers a robust, interpretable, and scalable solution for aligning generative models with complex human preferences, establishing a foundation for future advances in preference learning and model alignment.

Reference: "Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models" (2605.26491).

Markdown Report Issue