Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization (2403.00483v1)

Published 1 Mar 2024 in cs.CV

Abstract: Text-to-image customization, which aims to synthesize text-driven images for the given subjects, has recently revolutionized content creation. Existing works follow the pseudo-word paradigm, i.e., represent the given subjects as pseudo-words and then compose them with the given text. However, the inherent entangled influence scope of pseudo-words with the given text results in a dual-optimum paradox, i.e., the similarity of the given subjects and the controllability of the given text could not be optimal simultaneously. We present RealCustom that, for the first time, disentangles similarity from controllability by precisely limiting subject influence to relevant parts only, achieved by gradually narrowing real text word from its general connotation to the specific subject and using its cross-attention to distinguish relevance. Specifically, RealCustom introduces a novel "train-inference" decoupled framework: (1) during training, RealCustom learns general alignment between visual conditions to original textual conditions by a novel adaptive scoring module to adaptively modulate influence quantity; (2) during inference, a novel adaptive mask guidance strategy is proposed to iteratively update the influence scope and influence quantity of the given subjects to gradually narrow the generation of the real text word. Comprehensive experiments demonstrate the superior real-time customization ability of RealCustom in the open domain, achieving both unprecedented similarity of the given subjects and controllability of the given text for the first time. The project page is https://corleone-huang.github.io/realcustom/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A neural space-time representation for text-to-image personalization. arXiv preprint arXiv:2305.15391, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  5. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  6. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
  7. Dreamidentity: Improved editability for efficient face-identity preserved image generation. arXiv preprint arXiv:2307.00300, 2023.
  8. Multiresolution textual inversion. arXiv preprint arXiv:2211.17115, 2022.
  9. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  11. Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
  12. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  13. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  14. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  15. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  16. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  17. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  18. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023a.
  19. Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864, 2023b.
  20. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023c.
  21. Guiding text-to-image diffusion model towards grounded generation. arXiv preprint arXiv:2301.05221, 2023d.
  22. Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327, 2023.
  23. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  24. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  25. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  26. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  27. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  28. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  29. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  30. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  31. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  32. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  33. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
  34. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848, 2023.
  35. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681, 2023.
  36. From text to mask: Localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:2309.04109, 2023.
  37. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  38. Prospect: Expanded conditioning for the personalization of attribute-aware image generation. arXiv preprint arXiv:2305.16225, 2023.
Citations (6)

Summary

  • The paper introduces RealCustom, an approach that iteratively refines real text words to balance similarity and controllability in image generation.
  • Its adaptive scoring module and mask guidance strategy dynamically adjust subject influence, achieving superior quantitative and qualitative results.
  • The method successfully resolves the dual-optimum paradox, opening avenues for advanced generative AI applications in personalized media and gaming.

Disentangling Similarity and Controllability in Text-to-Image Customization with RealCustom

Introduction to RealCustom

The emergence of text-to-image models has significantly impacted AI-driven content creation, promising to tailor visual content based on textual descriptions. A typical approach involves representing subjects with pseudo-words and integrating these into the text prompts for image generation. However, this strategy struggles to optimize both the resemblance to specific subjects (similarity) and adherence to the descriptive context (controllability), a challenge known as the dual-optimum paradox.

RealCustom introduces a groundbreaking shift from the conventional paradigms, employing a method that accurately limits subject influences, ensuring high similarity and controllability. Unlike previous approaches that uniformly affect the entire generation with pseudo-words, RealCustom iteratively refines the influence of real text words, such as "toy," to match the specificities of a given subject, like a "brown sloth toy". This process leverages the model's built-in cross-attention to progressively refine the generation focus.

The RealCustom Paradigm

  • Training and Inference: RealCustom's design splits into training and inference phases. During training, it learns to align visual elements with textual conditions through a unique adaptive scoring module, adapting influence quantities based on currently generated features and textual input. The inference phase employs an adaptive mask guidance strategy, iteratively adjusting the subject's influence scope and amount to refine the generation towards the specific subject.
  • Technical Contributions:
    • The adaptive scoring module modulates the influence quantity, selecting key visual features for incorporation into the generative process based on both visual and textual relevance.
    • The adaptive mask guidance during inference smoothly transitions the representation from a general concept to a specific subject, employing an innovative method that refines both the scope and quantity of influence.
  • Quantitative and Qualitative Outcomes: RealCustom demonstrates superior performance in various aspects:
    • It achieves notable improvements in similarity and controllability metrics over existing text-to-image customization methods.
    • The qualitative analysis shows RealCustom producing more accurate and contextually relevant images compared to state-of-the-art alternatives, confirming its ability to resolve the dual-optimum paradox effectively.
    • RealCustom's iterative refinement strategy ensures the generated images faithfully represent the given subjects while accurately following textual descriptions, showcasing remarkable open-domain customization capabilities.

Implications and Future Directions

RealCustom's methodology has broad implications for the development of generative AI and its application in content creation. By disentangling the intertwined goals of similarity and controllability, it opens up new avenues for more nuanced and flexible content generation that can cater to a wide range of real-world applications, from personalized media to dynamic content generation for gaming and virtual environments.

This work also sets the stage for future research in improving the efficiency and effectiveness of text-to-image models. Potential directions include exploring more sophisticated mechanisms for influence modulation, extending the approach to video and other media types, and further enhancing model generalization to unseen subjects and contexts.

In conclusion, RealCustom marks a significant advance in the field of text-to-image customization. Its novel framework not only addresses the limitations of existing approaches but also broadens the horizon for creative and practical applications of generative AI technologies.