Papers
Topics
Authors
Recent
Search
2000 character limit reached

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Published 8 Jul 2024 in cs.CV and cs.GR | (2407.06187v1)

Abstract: Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.

Citations (6)

Summary

  • The paper introduces JeDi, a finetuning-free method for personalized text-to-image generation that learns the joint distribution of multiple text-image pairs sharing a subject.
  • JeDi modifies diffusion U-Net self-attention layers using coupled attention to encode relationships between multiple reference images during training and generation.
  • At test time, JeDi uses reference images and guidance, achieving state-of-the-art personalized generation with a masked DINO score of 0.8037 using three references.

This paper introduces Joint-Image Diffusion (JeDi), a finetuning-free method for personalized text-to-image generation that learns the joint distribution of multiple related text-image pairs that share a common subject.

  • A scalable synthetic dataset generation technique is proposed using LLMs and pre-trained single-image diffusion models to facilitate the training of the joint distribution.
  • The method modifies the self-attention layers of the diffusion U-Net to encode relationships between multiple images in a sample set via coupled self-attention, where the attention blocks corresponding to different input images are coupled.
  • At test time, JeDi uses reference images as input during the sampling process and utilizes guidance techniques to improve image alignment, achieving state-of-the-art results compared to both finetuning-based and finetuning-free personalization baselines with a masked DINO score of 0.8037 using three reference images.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.