Dual Caption Preference Optimization for Diffusion Models (2502.06023v1)

Published 9 Feb 2025 in cs.CV

Abstract: Recent advancements in human preference optimization, originally developed for LLMs, have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network's ability to accurately predict noise in preference optimization methods, known as the irrelevant prompt issue. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.

PDF Abstract

Dual Caption Preference Optimization for Diffusion Models

The paper "Dual Caption Preference Optimization for Diffusion Models" introduces a novel methodology for enhancing the alignment of text-to-image diffusion models with human preferences. The Dual Caption Preference Optimization (DCPO) addresses two key challenges identified in existing preference datasets: conflict distribution and irrelevant prompt issues, both of which inhibit the optimal performance of diffusion models.

Diffusion models have shown strong potential in generating high-quality and realistic images through their capacity to model complex data distributions. Recent advances leveraging human preference optimization—originally designed for LLMs—suggest these techniques can be adapted to diffusion models. These methods focus on learning to differentiate preferred samples from less preferred ones. Traditional preference datasets often suffer from overlapping distributions between preferred and less-preferred samples, termed conflict distribution, diminishing the efficacy of the training process. Additionally, the presence of prompts in the dataset that include irrelevant information can exacerbate issues during the denoising process of the diffusion model.

To counteract these challenges, the authors propose DCPO, which deploys separate captions for preferred and less preferred images. The methodology includes the creation of a Pick-Double Caption dataset, a modified version of the existing Pick-a-Pic v2 dataset, integrated with distinct captions for both categories of images. The DCPO approach encompasses three strategic approaches for caption generation: a direct captioning method, a perturbation approach that subtly alters the semantic meaning of the prompt, and a hybrid method combining both strategies.

Empirical evaluation demonstrates that DCPO markedly improves the image quality and fidelity to prompts, surpassing state-of-the-art models like Stable Diffusion (SD) 2.1, Diffusion-DPO, and MaPO in various metrics: Pickscore, CLIPscore, and ImageReward among others. DCPO's implementation intricately balances the need for a high correlation between the image and its prompt, while ensuring distinct captioning generates pronounced divergence between preferred and less-preferred image distributions.

The paper explores the computational benefits and challenges posed by dual captioning, particularly emphasizing the cost-effectiveness in model alignment compared to extensive caption generation processes. The findings suggest promising extensions of DCPO to further refinements of diffusion models in mapping human preferences more reliably, which could lead to advancements in several practical applications such as personalized content generation, safety features, and style transfer in computer vision tasks.

The broader implications of the paper place a spotlight on leveraging nuanced semantic differences to enhance the learning efficiency of diffusion models. Future developments may explore the integration of DCPO with advanced diffusion models like Stable Diffusion XL (SDXL) and explore applications in other domains requiring stringent alignment to human preference, potentially expanding the usability of these technologies in entertainment, educational tools, and creative industries.

In conclusion, DCPO stands out as an effective optimization mechanism that leverages human feedback to enhance the alignment quality of text-to-image diffusion models, opening avenues for further research in the fine-tuning and applicability of semantic preference optimization in AI-related fields.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Amir Saeidi (8 papers)
Yiran Luo (11 papers)
Agneet Chatterjee (7 papers)
Shamanthak Hegde (4 papers)
Bimsara Pathiraja (7 papers)
Yezhou Yang (119 papers)
Chitta Baral (152 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/arXivGPT/status/1889738258059247658