Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator (2411.15466v2)

Published 23 Nov 2024 in cs.CV

Abstract: Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

Summary

The paper introduces Diptych Prompting, reframing text-to-image generation as an inpainting task for zero-shot subject-driven image synthesis.
It employs techniques like background removal and reference attention enhancement to preserve fine-grained subject details and boost image fidelity.
Experimental results using DINO and CLIP metrics validate its superior performance and versatility in stylized generation and image editing tasks.

Analyzing Diptych Prompting: A Novel Approach for Zero-Shot Subject-Driven Text-to-Image Generation

The paper "Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator" introduces a novel approach, Diptych Prompting, for zero-shot subject-driven text-to-image generation. This method leverages the inpainting capabilities of large-scale text-to-image models, particularly focusing on the emergent property of diptych generation.

Methodology and Key Components

The authors redefine the challenge of zero-shot subject-driven text-to-image generation by treating it as an inpainting task within a diptych framework. Diptych Prompting involves arranging an incomplete diptych where the left panel contains the reference image and the right panel is subjected to text-conditioned inpainting. This concept builds upon the advanced text comprehension and image generation abilities of the FLUX model, a large-scale multimodal diffusion transformer, showcasing superior capabilities in generating coherent diptych images.

Key enhancements in this methodology include background removal from the reference image to avoid unwanted content leakage and the introduction of reference attention enhancement. By rescaling the attention weights between the panels, the model better preserves fine-grained details in the subject depicted in the reference image, improving the fidelity of the generated images.

Experimental Observations

Through extensive experimentation, Diptych Prompting demonstrates superior performance in subject-driven generation tasks compared to existing encoder-based zero-shot methods, as substantiated by human evaluation studies. Notably, the approach exhibits remarkable versatility, extending beyond subject-driven text-to-image generation to support tasks such as stylized image generation and subject-driven image editing.

Quantitatively, evaluations using metrics such as DINO and CLIP-based scores exhibit that the method performs comparably or better in subject and text alignment when contrasted with baseline models. While qualitative results confirm the preservation of intricate subject details and effective rendering of prompt-driven contexts across diverse subjects, human preference studies highlight the perceptual quality and alignment of generated images according to target texts, offering a strong endorsement over traditional methods.

Implications and Future Directions

The implications of this research are multifaceted. The introduction of Diptych Prompting leverages the inherent capabilities of large-scale TTI models, such as FLUX, in novel and efficient ways, indicating potential pathways for enhancing model interpretability and efficiency in generative tasks. Future developments might explore extending these methodologies to multi-subject scenarios and integrating similar frameworks into other dimensions of generative AI, like video and 3D content synthesis.

In conclusion, this paper provides a substantial contribution to the field of AI-driven image generation by devising a new perspective on zero-shot methods. The proposed framework not only capitalizes on the advanced capacities of large-scale models for generating high-quality, contextually accurate images but also opens up avenues for further innovation and application across a range of computational art and design tasks.

PDF Markdown

Related Papers

GitHub

DiptychPrompting

Tweets

https://twitter.com/dreamingtulpa/status/1863991693697315293

https://twitter.com/Art_Intelligo/status/1864179250229580187

https://twitter.com/Art_Intelligo/status/1864175743414014371