Score Distillation Sampling with Learned Manifold Corrective (2401.05293v2)
Abstract: Score Distillation Sampling (SDS) is a recent but already widely popular method that relies on an image diffusion model to control optimization problems using text prompts. In this paper, we conduct an in-depth analysis of the SDS loss function, identify an inherent problem with its formulation, and propose a surprisingly easy but effective fix. Specifically, we decompose the loss into different factors and isolate the component responsible for noisy gradients. In the original formulation, high text guidance is used to account for the noise, leading to unwanted side effects such as oversaturation or repeated detail. Instead, we train a shallow network mimicking the timestep-dependent frequency bias of the image diffusion model in order to effectively factor it out. We demonstrate the versatility and the effectiveness of our novel loss formulation through qualitative and quantitative experiments, including optimization-based image synthesis and editing, zero-shot image translation network training, and text-to-3D synthesis.
- Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In IEEE Conf. Comput. Vis. Pattern Recog., 2023.
- Stargan v2: Diverse image synthesis for multiple domains. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
- Diffusion posterior sampling for general noisy inverse problems. In Int. Conf. Learn. Represent., 2022a.
- Improving diffusion models for inverse problems using manifold constraints. Adv. Neural Inform. Process. Syst., 35:25683–25696, 2022b.
- Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell., 2023.
- Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10021–10030, 2023.
- Delta denoising score. In Int. Conf. Comput. Vis., 2023.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst., 33:6840–6851, 2020.
- Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1911–1920, 2023.
- Denoising diffusion restoration models. In Adv. Neural Inform. Process. Syst., 2022.
- Imagic: Text-based real image editing with diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6007–6017, 2023.
- Collaborative score distillation for consistent visual synthesis. arXiv preprint arXiv:2307.04787, 2023.
- Dreamhuman: Animatable 3d avatars from text. In Adv. Neural Inform. Process. Syst., 2023.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis., 2020.
- Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388, 2023.
- Magic3d: High-resolution text-to-3d content creation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 300–309, 2023a.
- Common diffusion noise schedules and sample steps are flawed. arXiv preprint arXiv:2305.08891, 2023b.
- Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29, 2016.
- Zero-1-to-3: Zero-shot one image to 3d object. In Int. Conf. Comput. Vis., pages 9298–9309, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11461–11471, 2022.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In Int. Conf. Learn. Represent., 2022.
- Latent-nerf for shape-guided generation of 3d shapes and textures. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12663–12673, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Int. Conf. on Mach. Learn., 2022.
- Benchmark for compositional text-to-image synthesis. In Conf. on Neural Inf. Proc. Systems Datasets and Benchmarks Track (Round 1), 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. In Int. Conf. Learn. Represent., 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE Conf. Comput. Vis. Pattern Recog., pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 22500–22510, 2023.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst., 35:36479–36494, 2022b.
- Mvdream: Multi-view diffusion for 3d generation. arXiv:2308.16512, 2023.
- Denoising diffusion implicit models. In Int. Conf. Learn. Represent., 2020.
- Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 12619–12629, 2023a.
- Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023b.
- The unreasonable effectiveness of deep features as a perceptual metric. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023.
- Sparse3d: Distilling multiview-consistent diffusion for object reconstruction from sparse views. arXiv preprint arXiv:2308.14078, 2023.