DreamTuner: Single Image is Enough for Subject-Driven Generation (2312.13691v1)
Abstract: Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.
- Clip2stylegan: Unsupervised extraction of stylegan edit directions. In Special Interest Group on Computer Graphics and Interactive Techniques Conference, pages 48:1–48:9, 2022.
- Danbooru2021: A large-scale crowdsourced and tagged anime illustration dataset. https://gwern.net/danbooru2021, January 2022. Accessed: DATE.
- Instructpix2pix: Learning to follow image editing instructions. 2022.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
- Subject-driven text-to-image generation via apprenticeship learning. 2023.
- Cogview: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems, pages 19822–19835, 2021.
- Taming transformers for high-resolution image synthesis. In Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. 2022.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Prompt-to-prompt image editing with cross attention control. 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Draw your art dream: Diverse digital art synthesis with multimodal guided diffusion. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1085–1094, 2022.
- Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. 2023.
- Imagic: Text-based real image editing with diffusion models. 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. 2023.
- Deepdanbooru. 2019.
- Revisiting image pyramid structure for high resolution salient object detection. In Asian Conference on Computer Vision, pages 257–273, 2022.
- Multi-concept customization of text-to-image diffusion. 2023.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. 2023.
- Unified multi-modal latent diffusion for joint subject and text conditional image generation. 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Mystyle: A personalized generative prior. arXiv preprint arXiv:2203.17272, 2022.
- Styleclip: Text-driven manipulation of stylegan imagery. In International Conference on Computer Vision, pages 2065–2074, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
- Online and linear-time attention by enforcing monotonic alignments. In International Conference on Machine Learning, pages 2837–2846, 2017.
- Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, pages 8821–8831, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with VQ-VAE-2. In Advances in Neural Information Processing Systems, pages 14837–14847, 2019.
- High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. 2023.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Cycle-consistent inverse gan for text-to-image synthesis. In ACM International Conference on Multimedia, page 630–638, 2021.
- Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. 2022.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Conference on Computer Vision and Pattern Recognition, pages 1316–1324, 2018.
- Scaling autoregressive models for content-rich text-to-image generation. 2022.
- Adding conditional control to text-to-image diffusion models. 2023.