JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation (2407.06187v1)

Published 8 Jul 2024 in cs.CV and cs.GR

Abstract: Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.

Authors (7)

Yu Zeng (60 papers)
Vishal M. Patel (230 papers)
Haochen Wang (64 papers)
Xun Huang (29 papers)
Ting-Chun Wang (26 papers)
Ming-Yu Liu (87 papers)
Yogesh Balaji (22 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces JeDi, a finetuning-free method for personalized text-to-image generation that learns the joint distribution of multiple text-image pairs sharing a subject.
JeDi modifies diffusion U-Net self-attention layers using coupled attention to encode relationships between multiple reference images during training and generation.
At test time, JeDi uses reference images and guidance, achieving state-of-the-art personalized generation with a masked DINO score of 0.8037 using three references.

This paper introduces Joint-Image Diffusion (JeDi), a finetuning-free method for personalized text-to-image generation that learns the joint distribution of multiple related text-image pairs that share a common subject.

A scalable synthetic dataset generation technique is proposed using LLMs and pre-trained single-image diffusion models to facilitate the training of the joint distribution.
The method modifies the self-attention layers of the diffusion U-Net to encode relationships between multiple images in a sample set via coupled self-attention, where the attention blocks corresponding to different input images are coupled.
At test time, JeDi uses reference images as input during the sampling process and utilizes guidance techniques to improve image alignment, achieving state-of-the-art results compared to both finetuning-based and finetuning-free personalization baselines with a masked DINO score of 0.8037 using three reference images.

PDF Markdown

Tweets

https://twitter.com/stateof_ai/status/1810578129779573104

https://twitter.com/gm8xx8/status/1810506887822999748

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation (2407.06187v1)

Summary

Related Papers

Tweets