Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Score Distillation Sampling with Learned Manifold Corrective (2401.05293v2)

Published 10 Jan 2024 in cs.CV

Abstract: Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
  2. Instructpix2pix: Learning to follow image editing instructions. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
  3. Stargan v2: Diverse image synthesis for multiple domains. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
  4. Diffusion posterior sampling for general noisy inverse problems. In Int. Conf. Learn. Represent., 2022a.
  5. Improving diffusion models for inverse problems using manifold constraints. Adv. Neural Inform. Process. Syst., 35:25683–25696, 2022b.
  6. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2023.
  7. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10021–10030, 2023.
  8. Delta denoising score. In Int. Conf. Comput. Vis., 2023.
  9. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  10. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  11. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
  12. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1920, 2023.
  13. Denoising diffusion restoration models. In Adv. Neural Inform. Process. Syst., 2022.
  14. Imagic: Text-based real image editing with diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6007–6017, 2023.
  15. Collaborative score distillation for consistent visual synthesis. arXiv preprint arXiv:2307.04787, 2023.
  16. Dreamhuman: Animatable 3d avatars from text. In Adv. Neural Inform. Process. Syst., 2023.
  17. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis., 2020.
  18. Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388, 2023.
  19. Magic3d: High-resolution text-to-3d content creation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 300–309, 2023a.
  20. Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023b.
  21. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.
  22. Zero-1-to-3: Zero-shot one image to 3d object. In Int. Conf. Comput. Vis., pages 9298–9309, 2023.
  23. Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11461–11471, 2022.
  24. SDEdit: Guided image synthesis and editing with stochastic differential equations. In Int. Conf. Learn. Represent., 2022.
  25. Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12663–12673, 2023.
  26. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Int. Conf. on Mach. Learn., 2022.
  27. Benchmark for compositional text-to-image synthesis. In Conf. on Neural Inf. Proc. Systems Datasets and Benchmarks Track (Round 1), 2021.
  28. Dreamfusion: Text-to-3d using 2d diffusion. In Int. Conf. Learn. Represent., 2022.
  29. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  30. High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
  31. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 22500–22510, 2023.
  32. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  33. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022b.
  34. Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
  35. Denoising diffusion implicit models. In Int. Conf. Learn. Represent., 2020.
  36. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12619–12629, 2023a.
  37. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
  38. The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
  39. Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023.
  40. Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078, 2023.
Citations (8)

Summary

  • The paper introduces LMC-SDS as a correction to SDS by training a shallow network to remove noisy gradients and mitigate artifacts in image synthesis.
  • It demonstrates that lowering text guidance is feasible while maintaining stability and improving the overall visual quality of generated images.
  • Empirical results validate LMC-SDS across diverse applications, including text-to-image, image editing, and text-to-3D synthesis, with enhanced detail and clarity.

Introduction

Researchers have presented an insightful analysis of Score Distillation Sampling (SDS) as used with image diffusion models controlled by text prompts. SDS, utilized in creative applications such as text-to-image or text-to-3D synthesis, has shown remarkable capabilities but is not free from drawbacks. Issues like noisy gradients, excessive guidance, and side effects on image quality have been noted. Addressing these concerns, the authors have introduced an improved loss formulation known as SDS with Learned Manifold Corrective (LMC-SDS).

Understanding the Original SDS

SDS employs a pretrained text-to-image model to gauge how closely an image aligns with a textual description. While powerful, SDS could compromise image observations, be overly aggressive in matching the text prompt, or yield ineffective gradients contributing noise to the optimization goal. The paper breaks down the SDS loss, unearthing the components behind its weaknesses. They point out that even though high text guidance compensated for the noise in SDS, it led to picture degradation and artifacts.

The LMC-SDS Solution

The paper proposes a simple yet effective solution by training a shallow network to mimic the denoising inadequacies of the image diffusion model. The goal is to exclude this erroneous factor from influencing the gradients. LMC-SDS strives to yield better gradients, permit the use of lower text guidance, and enhance the visual quality of the results. The researchers back up their findings through a variety of experiments that showcase the robustness and flexibility of LMC-SDS across multiple applications.

Empirical Evidence and Applications

Extensive testing highlights LMC-SDS's superiority for tasks such as optimization-based image synthesis and editing, image-to-image translation network training, and even text-to-3D synthesis. For instance, in 3D asset generation, LMC-SDS enables the production of more detailed and sharp images compared to the original SDS formulation. Additionally, LMC-SDS allows for diverse outcomes in image editing by fixing certain parameters during the optimization process, displaying versatility in creative contexts.

Conclusion and Future Directions

LMC-SDS represents a leap forward in resolving critical issues associated with SDS when integrating diffusion models into optimization challenges. The meticulous decomposition of SDS and the proposition of LMC-SDS reflect a thoughtful approach to improving the gradients used in image manipulation tasks. The novel loss formulation is a substantial initial step toward stable and meaningful applications, with further exploration anticipated in enhancing the corrective component and employing these insights in practical creative scenarios.