Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
94 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
103 tokens/sec
DeepSeek R1 via Azure Premium
93 tokens/sec
GPT OSS 120B via Groq Premium
462 tokens/sec
Kimi K2 via Groq Premium
254 tokens/sec
2000 character limit reached

Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models (2503.20240v2)

Published 26 Mar 2025 in cs.CV

Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

Summary

Overview of "Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models"

Introduction

The paper "Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models," authored by Prin Phunyaphibarn et al., presents an innovative approach to enhancing the quality of conditional generation in diffusion models. Current methodologies utilize Classifier-Free Guidance (CFG) to train diffusion models for conditional generation tasks. However, this method often results in suboptimal unconditional priors, particularly when models are fine-tuned, adversely affecting the conditional generation quality. The authors propose leveraging richer unconditional noise predictions from a separate pretrained model to substantially enhance the performance of fine-tuned conditional diffusion models.

Background and Problem Statement

Diffusion models have become a predominant choice for generative tasks across modalities, such as images, videos, and audio, due to their robust performance and flexible training mechanisms. As a core technique, CFG allows a model to learn both conditional and unconditional noise predictions. However, with limited bandwidth during fine-tuning, the unconditional noise predictions typically degrade, leading to poorer generation results when these priors are combined with conditional predictions in CFG-based sampling.

Methodology

The authors propose a straightforward yet effective fix: replacing the unconditional noise predictions of fine-tuned models with those from a pretrained model exhibiting superior unconditional generation capabilities. This approach does not necessitate additional training or architectural modifications, allowing for easy implementation. Remarkably, the paper shows that these unconditional noise predictions can come from models trained with different architectures or on distinct datasets from the original base model used for fine-tuning.

Experimental Evaluation

The proposed method was empirically validated across a diverse set of conditional diffusion models used for various generation tasks, including Zero-1-to-3 for novel view synthesis, Versatile Diffusion for image variations, InstructPix2Pix for image editing, and dynamic video generation with DynamiCrafter. The intervention demonstrated notable improvement in the generation quality across these models. For instance, the approach improved image alignment and aesthetic quality in Versatile Diffusion, resulting in lower FID scores, a quantitative measure of image quality. Similarly, in novel view synthesis, LPIPS metric indicated enhanced perceptual similarity.

Implications

This research highlights the significance of unconditional priors in CFG-based diffusion models, particularly during fine-tuning. The methodology bears practical implications for computational efficiency and the ability to maintain high-quality generation outputs across diverse tasks. Theoretically, this suggests the separate learning of unconditional and conditional noise predictions, rather than their joint modeling in fine-tuning tasks, as a beneficial strategy.

Conclusion and Future Directions

The authors have laid the groundwork for further exploration into unconditional priors in AI-driven generative models. Future work could explore optimizing the CFG scale in conjunction with unconditional noise replacement to further enhance output quality. In addition, integrating advanced pretrained diffusion models into this framework presents opportunities for improved task-specific generation intricacy and precision.

In summary, this paper provides a valuable perspective on improving diffusion model performance and establishes a pathway for advancing generative model development through better handling of unconditional priors during fine-tuning processes.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.