TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition (2307.12493v4)

Published 24 Jul 2023 in cs.CV and cs.AI

Abstract: Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON

Citations (64)

View on Semantic Scholar

Summary

The paper introduces TF-ICON, leveraging diffusion models for training-free, cross-domain image composition in only 20 sampling steps.
It achieves superior image inversion accuracy using high-order ODE solvers and an exceptional prompt to outperform current state-of-the-art methods.
The framework generates photorealistic compositions while preserving semantic layouts and consistent foreground-background details across diverse art styles.

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

The paper introduces TF-ICON, an innovative framework dedicated to cross-domain image composition utilizing text-driven diffusion models. The primary focus of this work is to seamlessly integrate distinct objects from various domains into a specific visual context without the need for additional training or fine-tuning. Leveraging off-the-shelf diffusion models, TF-ICON addresses traditional issues associated with costly instance-based optimization prevalent in existing image composition methods. This is achieved while preserving the model's rich prior knowledge and ensuring efficient execution within 20 sampling steps.

Methodology

The TF-ICON framework is composed of two main stages: image inversion and composition generation.

Image Inversion: The inversion process employs a novel "exceptional prompt," which contains no content or positional embeddings, to facilitate the accurate inversion of real images into latent representations. The authors argue that high-order diffusion ODE solvers, such as DPM-Solver++, provide superior latent code inversion compared to the commonly used DDIM, owing to better alignment between forward and backward trajectories. This approach achieved competitive results in reconstructing real images on datasets like CelebA-HQ, COCO, and ImageNet.
Composition Generation: Through a specially designed approach involving noise incorporation and the injection of composite self-attention maps, TF-ICON synthesizes harmonized compositions from inverted noise. Attention maps retain semantic layout details, while cross-attention maps enhance the cohesion between main and reference images.

Contributions and Results

The paper makes several significant contributions:

Demonstrating higher accuracy in image inversion with high-order ODE solvers compared to DDIM.
Introducing an exceptional prompt that significantly enhances invertibility, outperforming state-of-the-art baselines across various datasets.
Presenting a training-free framework that effectively enables cross-domain image-guided composition.
Empirically showing that TF-ICON surpasses current methodologies in both qualitative and quantitative assessments, particularly in the photorealism domain.

Quantitative evaluation in the photorealism domain using metrics such as LPIPS and CLIP scores confirmed TF-ICON's superior performance in maintaining both background consistency and foreground fidelity. Moreover, a user paper indicated that participants preferred TF-ICON's results across multiple domains including oil painting, sketching, and cartoon animation, corroborating its versatile compositional prowess.

Implications and Future Directions

TF-ICON represents a notable step forward in the field of image synthesis, offering a practical solution for image-guided composition across diverse artistic styles. The framework's ability to function without dedicated training renders it particularly advantageous for applications requiring adaptability across various domains, such as digital art, media production, and advertising.

Future research could extend TF-ICON's ability to generate new viewpoints of objects—a task currently limited by reliance on self-attention map preservation. Integrating techniques from areas like personalized concept learning or volumetric rendering could potentially overcome this limitation. Additionally, addressing the biases and ethical concerns associated with large diffusion models remains an ongoing challenge.

In conclusion, TF-ICON establishes a robust approach to cross-domain image composition, presenting opportunities for enhanced content creation and manipulation, which could be pivotal in numerous creative and industrial applications.

PDF Markdown

Related Papers

GitHub

GitHub - Shilin-LU/TF-ICON: [ICCV 2023] "TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition" (Official Implementation) (802 stars)