CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models (2406.08070v2)

Published 12 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Moreover, CFG++ can be easily integrated into high-order diffusion solvers and naturally extends to distilled diffusion models. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: https://cfgpp-diffusion.github.io/.

PDF HTML Abstract

CFG++: Manifold-Constrained Classifier Free Guidance for Diffusion Models

The paper "CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models" presents a novel approach that addresses the limitations associated with Classifier-Free Guidance (CFG) in diffusion models, especially in the context of text-guided image generation. The authors identify significant drawbacks of CFG, such as mode collapse, lack of invertibility in deterministic Image-to-Image (DDIM) inversion, and issues arising from high guidance scales, which originate from the off-manifold phenomenon instead of being inherent to diffusion models. Building on the burgeoning field of diffusion model-based inverse problem solvers, the paper proposes CFG++, a manifold-constrained guidance technique that incorporates a text-conditioned score matching loss to mitigate these challenges effectively.

Key Contributions and Methodology

The paper proposes CFG++ as a solution to address the manifold-related pitfalls of traditional CFG by reframing text-guidance as an inverse problem. CFG++ leverages text-conditioned score matching losses within a novel sampling method to achieve improved performance and robustness in diffusion models. This results in several enhancements:

Improved Sample Quality: CFG++ shows significant improvements in generating high-quality text-to-image outputs and provides a seamless interpolation between unconditional and conditional sampling by maintaining smaller guidance scales.
Enhanced Invertibility: Unlike standard CFG, CFG++ supports near-perfect DDIM inversion by adopting a reformulated sampling strategy inspired by diffusion inverse problem solvers. This inversion ability is crucial for tasks such as image editing where reconstruction fidelity is paramount.
Reduction in Mode Collapse: By addressing the off-manifold trajectory shift inherent in CFG, CFG++ ensures smoother transitions during the reverse diffusion process. This reduces artifacts and collapses seen in high guidance scales typical for CFG.
Integration with Existing Solvers: The proposed method maintains compatibility with high-order solvers and can extend naturally to distilled diffusion models without introducing computational overhead.

The theoretical insight reveals the geometric distinctions that allow CFG++ to prevent off-manifold phenomenons showcased through smoother denoising trajectories compared to standard CFG. This theoretical foundation is pivotal as it strategically positions CFG++ to serve not only as a drop-in replacement but also to enhance existing frameworks that rely on diffusion models.

Experimental Validation

The authors present extensive experiments evaluating CFG++ against CFG across various tasks:

Text-to-Image Generation: Conducted with Stable Diffusion v1.5 and SDXL, CFG++ consistently portrays superior FID scores across various guidance scales. This underscores its robust text-image alignment conducive to better image quality and concept fidelity.
Image Inversion: Experimental results demonstrate CFG++'s enhanced inversion capabilities, showcasing higher-quality reconstructions through improved DDIM performance metrics, validated on real-world image datasets.
Text-Conditioned Inverse Problems: CFG++ was applied to various inverse problem contexts, showing enhanced performance in tasks like super-resolution and deblurring on the FFHQ dataset with latent diffusion inverse solvers.

Implications and Future Directions

The formulation of CFG++ opens avenues for refining text-conditioned generative processes by offering guidance paths that remain on-manifold, thereby ensuring more stable and reliable outputs. The implications span across any domain where text-guided diffusion models are applicable, such as art generation, scientific visualization, and realistic media synthesis.

Future work could explore extending the CFG++ framework beyond image domains, exploring text or audio generation, where diffusion models also play a critical role. Additionally, the insights gained could spur the development of more refined manifold guidance methods suitable for other generative architectures.

In summary, this paper extends the capabilities of diffusion models by effectively addressing CFG's constraints, providing a solid foundation for manifold-constrained guidances, and has meaningful implications for enhancing AI capabilities in diverse generative tasks.