Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning (2307.11410v2)

Published 21 Jul 2023 in cs.CV

Abstract: Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or multi-subject in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support multi-subject generation. Extensive qualitative and quantitative results demonstrate that our method outperforms other SOTA frameworks in single, multiple, and human customized image generation. Please refer to our \href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}

PDF Abstract

Subject-Diffusion: Personalized Text-to-Image Synthesis

Subject-Diffusion represents a significant advancement in the domain of personalized text-to-image generation by leveraging diffusion models. Unlike conventional diffusion models that require test-time fine-tuning or multiple reference images, Subject-Diffusion enables single- and two-subject image synthesis with just one reference image, bypassing the need for fine-tuning. This paper focuses on building an open-domain model, utilizing large-scale datasets for training without reliance on specific model adjustment at generation time.

Dataset and Framework

The authors introduce a novel dataset, constructed with 76 million images sourced from the LAION-Aesthetics dataset. This dataset is augmented with automatic data labeling tools, generating subject detection bounding boxes, segmentation masks, and text descriptions. This structured dataset aids in overcoming the data scarcity and annotation challenges typically faced in personalized image generation tasks.

Subject-Diffusion incorporates several innovations to maximize subject fidelity. The framework integrates:

Text-Image Fusion: A unified approach blends text and image semantics, enriching the conditioning of the generative model. This is achieved by alignment strategies that combine coarse location inputs with fine-grained reference image features.
Attention Control: To handle multiple subject generation effectively, an attention control mechanism is employed. This mechanism ensures distinct subject representations within images, tackling the issue of subject confusion often seen with multiple entities.

Evaluation and Results

In quantitative evaluations against established personalization methods on DreamBench and OpenImage datasets, Subject-Diffusion shows enhanced fidelity and generalization. It demonstrates superior DINO scores, surpassing existing methods in preserving subject identity and textual congruity. Notably, Subject-Diffusion performs comparably with fine-tuned models like DreamBooth regarding subject detail preservation and image-text alignment without undertaking the expensive fine-tuning process.

Practical and Theoretical Implications

Subject-Diffusion has wide-ranging implications for both theoretical exploration and practical applications. The theoretical implications extend into understanding diffusion processes' adaptability in text-to-image contexts, especially in dynamically managing image fidelity across open domains. Practically, the existing framework could revolutionize user interactions with generative models in fields such as digital art, media production, and personalized content creation.

The zero-shot capability opens avenues for easily scalable applications, requiring minimal user input while maximizing creative output. Users can generate personalized images in diverse scenarios, driven by single reference images, and apply Subject-Diffusion in real-time processes without excessive computational demands.

Future Directions

Future advancements may focus on refining subject manipulation, especially in attribute-level dynamics within generated images. Additional research could explore the scalability of such models across numerous concurrently interacting subjects and their respective domains. Additionally, expanding the dataset to encompass even broader contexts and integrating advanced semantic understanding in generation processes might further elevate the efficacy and scope of Subject-Diffusion.

In conclusion, Subject-Diffusion marks a pivotal step in open-domain personalized image generation, offering robust performance in a novel framework that aligns well with current computational and practical demands. The presented work lays the groundwork for future innovations in the generative AI landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jian Ma (99 papers)
Junhao Liang (10 papers)
Chen Chen (752 papers)
Haonan Lu (35 papers)

Citations (101)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos