Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

94 tokens/sec

Gemini 2.5 Pro Premium

55 tokens/sec

GPT-5 Medium

18 tokens/sec

GPT-5 High Premium

24 tokens/sec

GPT-4o

103 tokens/sec

DeepSeek R1 via Azure Premium

93 tokens/sec

GPT OSS 120B via Groq Premium

462 tokens/sec

Kimi K2 via Groq Premium

254 tokens/sec

2000 character limit reached

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models (2503.20240v2)

Published 26 Mar 2025 in cs.CV

Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

Summary

Overview of "Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models"

Introduction

The paper "Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models," authored by Prin Phunyaphibarn et al., presents an innovative approach to enhancing the quality of conditional generation in diffusion models. Current methodologies utilize Classifier-Free Guidance (CFG) to train diffusion models for conditional generation tasks. However, this method often results in suboptimal unconditional priors, particularly when models are fine-tuned, adversely affecting the conditional generation quality. The authors propose leveraging richer unconditional noise predictions from a separate pretrained model to substantially enhance the performance of fine-tuned conditional diffusion models.

Background and Problem Statement

Diffusion models have become a predominant choice for generative tasks across modalities, such as images, videos, and audio, due to their robust performance and flexible training mechanisms. As a core technique, CFG allows a model to learn both conditional and unconditional noise predictions. However, with limited bandwidth during fine-tuning, the unconditional noise predictions typically degrade, leading to poorer generation results when these priors are combined with conditional predictions in CFG-based sampling.

Methodology

The authors propose a straightforward yet effective fix: replacing the unconditional noise predictions of fine-tuned models with those from a pretrained model exhibiting superior unconditional generation capabilities. This approach does not necessitate additional training or architectural modifications, allowing for easy implementation. Remarkably, the paper shows that these unconditional noise predictions can come from models trained with different architectures or on distinct datasets from the original base model used for fine-tuning.

Experimental Evaluation

The proposed method was empirically validated across a diverse set of conditional diffusion models used for various generation tasks, including Zero-1-to-3 for novel view synthesis, Versatile Diffusion for image variations, InstructPix2Pix for image editing, and dynamic video generation with DynamiCrafter. The intervention demonstrated notable improvement in the generation quality across these models. For instance, the approach improved image alignment and aesthetic quality in Versatile Diffusion, resulting in lower FID scores, a quantitative measure of image quality. Similarly, in novel view synthesis, LPIPS metric indicated enhanced perceptual similarity.

Implications

This research highlights the significance of unconditional priors in CFG-based diffusion models, particularly during fine-tuning. The methodology bears practical implications for computational efficiency and the ability to maintain high-quality generation outputs across diverse tasks. Theoretically, this suggests the separate learning of unconditional and conditional noise predictions, rather than their joint modeling in fine-tuning tasks, as a beneficial strategy.

Conclusion and Future Directions

The authors have laid the groundwork for further exploration into unconditional priors in AI-driven generative models. Future work could explore optimizing the CFG scale in conjunction with unconditional noise replacement to further enhance output quality. In addition, integrating advanced pretrained diffusion models into this framework presents opportunities for improved task-specific generation intricacy and precision.

In summary, this paper provides a valuable perspective on improving diffusion model performance and establishes a pathway for advancing generative model development through better handling of unconditional priors during fine-tuning processes.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

Tweets

https://twitter.com/PrinPhunya/status/1906518899509850400

https://twitter.com/yuseungleee/status/1906581892729647465

https://twitter.com/jm_alexia/status/1920852926085386265