Rethinking Score Distillation as a Bridge Between Image Distributions (2406.09417v2)

Published 13 Jun 2024 in cs.CV, cs.GR, and cs.LG

Abstract: Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods' characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.

Citations (2)

View on Semantic Scholar

Summary

The paper reframes score distillation as an optimal transport problem, aligning corrupted and natural image distributions to mitigate common artifacts.
It identifies key errors from first-order approximations and source distribution mismatches, proposing multi-step methods and text-guided adjustments.
Experimental results on text-to-image, NeRF optimization, and painting-to-real tasks demonstrate reduced artifacts and enhanced fidelity with lower computational cost.

Rethinking Score Distillation as a Bridge Between Image Distributions

Introduction

The paper "Rethinking Score Distillation as a Bridge Between Image Distributions" by McAllister et al. offers an insightful analysis of existing score distillation sampling (SDS) methods and presents a novel framework to understand and enhance these methods. The authors argue that current methods suffer from characteristic artifacts due to simplistic approaches and propose an effective solution based on a more principled understanding of the underlying processes. Their approach mitigates key errors in the SDS methodology and offers improved results across several applications, including text-to-2D, NeRF optimization, painting translations, and optical illusion generation.

Background and Motivation

Diffusion models have been highly successful in generating high-quality images and other data distributions. They work well in data-rich domains, but their applicability is limited in domains with less data. SDS leverages large-scale diffusion priors for tasks in these data-poor domains. However, SDS often exhibits artifacts like oversaturation and oversmoothing, limiting its utility. The paper proposes understanding SDS through the lens of optimal transport and Schrödinger Bridge (SB) problems, offering a new interpretation of SDS and its variants to address these artifacts.

Core Contributions and Methodology

The paper’s core contributions and methodology can be summarized as follows:

SDS as an Optimal Transport Problem:
- The authors cast SDS as a problem of solving optimal-cost transport paths between a source and a target image distribution. The source is a corrupted image, while the target is the natural image distribution.
- This approach interprets SDS methods as attempts to transport images along these paths while identifying the cause of SDS artifacts as due to linear approximations and poor estimates of the source distribution.
Error Analysis:
- Two primary sources of errors in the current methods are identified:
  - First-order approximation error: Common methods use single-step noising and denoising, leading to errors that can be mitigated by multi-step methods.
  - Source distribution mismatch: Incorrect representation of the current optimized image distribution leads to errors. The paper demonstrates that recent methods can be seen as efforts to reduce these errors.
Proposed Solution:
- The paper proposes a simple yet effective method by describing the current source distribution with textual descriptions. By leveraging large-scale text-to-image diffusion models trained on vast caption-image pairs, this approach can bridge the current optimized image distribution to the natural image distribution with little computational overhead.

Experimental Results

The proposed method was tested across various tasks to evaluate its effectiveness:

Text-to-Image Generation:
- Using COCO captions, the method demonstrated competitive performance, achieving lower FID scores than several existing SDS variants with reduced computational cost.
- Visual results indicated that the proposed method produced realistic images with fewer artifacts compared to conventional SDS, NFSD, and CSD.
Text-Guided NeRF Optimization:
- The method was tested for text-to-3D generation, showing improved results over SDS without requiring the computational overhead of training a LoRA model as in VSD.
- Quantitative comparisons in terms of CLIP similarity showed competitive performance, and qualitative results highlighted better geometric and color fidelity.
Painting-to-Real Translation:
- In this application, the proposed method effectively transformed paintings into high-quality realistic images, outperforming image restoration baselines.
3D Sketch-to-Real and Optical Illusion Generation:
- For 3D sketch-to-real tasks, the method successfully transformed coarse 3D sketches into detailed realistic objects.
- In optical illusion generation, the method produced more convincing results compared to SDS, overcoming issues like color artifacts.

Implications and Future Directions

The research offers significant theoretical and practical implications. By reinterpreting SDS within the context of optimal transport and SB problems, it provides a more robust and generalized framework for understanding and improving score distillation methods.

The method's effectiveness across a range of tasks suggests that textual descriptions can serve as a practical alternative to more computationally intensive approaches. This can be particularly beneficial for applications requiring real-time or near real-time performance, such as augmented reality and interactive content creation.

Future research directions could include exploring multi-step estimation techniques to further mitigate first-order approximation errors and investigating the robustness of the proposed method across even broader domains and tasks. The integration of this approach with emerging high-quality video diffusion models could also be a promising avenue, potentially extending the benefits observed in still images to dynamic content.

Conclusion

The paper by McAllister et al. offers a comprehensive and methodologically sound solution to improve score distillation sampling by reframing it as an optimal transport problem. Through insightful analysis and robust experimentation, the proposed method demonstrates significant improvements in various applications, suggesting that the field may benefit greatly from this new perspective. By reducing computational overhead without compromising quality, this research opens new pathways for applying diffusion models in data-poor domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Songwei_Ge/status/1802501403241517475

https://twitter.com/taziku_co/status/1801549606133895351

https://twitter.com/arxivsanitybot/status/1801970056907096297