RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization (2403.00483v1)

Published 1 Mar 2024 in cs.CV

Abstract: Text-to-image customization, which aims to synthesize text-driven images for the given subjects, has recently revolutionized content creation. Existing works follow the pseudo-word paradigm, i.e., represent the given subjects as pseudo-words and then compose them with the given text. However, the inherent entangled influence scope of pseudo-words with the given text results in a dual-optimum paradox, i.e., the similarity of the given subjects and the controllability of the given text could not be optimal simultaneously. We present RealCustom that, for the first time, disentangles similarity from controllability by precisely limiting subject influence to relevant parts only, achieved by gradually narrowing real text word from its general connotation to the specific subject and using its cross-attention to distinguish relevance. Specifically, RealCustom introduces a novel "train-inference" decoupled framework: (1) during training, RealCustom learns general alignment between visual conditions to original textual conditions by a novel adaptive scoring module to adaptively modulate influence quantity; (2) during inference, a novel adaptive mask guidance strategy is proposed to iteratively update the influence scope and influence quantity of the given subjects to gradually narrow the generation of the real text word. Comprehensive experiments demonstrate the superior real-time customization ability of RealCustom in the open domain, achieving both unprecedented similarity of the given subjects and controllability of the given text for the first time. The project page is https://corleone-huang.github.io/realcustom/.

References (38)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces RealCustom, an approach that iteratively refines real text words to balance similarity and controllability in image generation.
Its adaptive scoring module and mask guidance strategy dynamically adjust subject influence, achieving superior quantitative and qualitative results.
The method successfully resolves the dual-optimum paradox, opening avenues for advanced generative AI applications in personalized media and gaming.

Disentangling Similarity and Controllability in Text-to-Image Customization with RealCustom

Introduction to RealCustom

The emergence of text-to-image models has significantly impacted AI-driven content creation, promising to tailor visual content based on textual descriptions. A typical approach involves representing subjects with pseudo-words and integrating these into the text prompts for image generation. However, this strategy struggles to optimize both the resemblance to specific subjects (similarity) and adherence to the descriptive context (controllability), a challenge known as the dual-optimum paradox.

RealCustom introduces a groundbreaking shift from the conventional paradigms, employing a method that accurately limits subject influences, ensuring high similarity and controllability. Unlike previous approaches that uniformly affect the entire generation with pseudo-words, RealCustom iteratively refines the influence of real text words, such as "toy," to match the specificities of a given subject, like a "brown sloth toy". This process leverages the model's built-in cross-attention to progressively refine the generation focus.

The RealCustom Paradigm

Training and Inference: RealCustom's design splits into training and inference phases. During training, it learns to align visual elements with textual conditions through a unique adaptive scoring module, adapting influence quantities based on currently generated features and textual input. The inference phase employs an adaptive mask guidance strategy, iteratively adjusting the subject's influence scope and amount to refine the generation towards the specific subject.
Technical Contributions:
- The adaptive scoring module modulates the influence quantity, selecting key visual features for incorporation into the generative process based on both visual and textual relevance.
- The adaptive mask guidance during inference smoothly transitions the representation from a general concept to a specific subject, employing an innovative method that refines both the scope and quantity of influence.
Quantitative and Qualitative Outcomes: RealCustom demonstrates superior performance in various aspects:
- It achieves notable improvements in similarity and controllability metrics over existing text-to-image customization methods.
- The qualitative analysis shows RealCustom producing more accurate and contextually relevant images compared to state-of-the-art alternatives, confirming its ability to resolve the dual-optimum paradox effectively.
- RealCustom's iterative refinement strategy ensures the generated images faithfully represent the given subjects while accurately following textual descriptions, showcasing remarkable open-domain customization capabilities.

Implications and Future Directions

RealCustom's methodology has broad implications for the development of generative AI and its application in content creation. By disentangling the intertwined goals of similarity and controllability, it opens up new avenues for more nuanced and flexible content generation that can cater to a wide range of real-world applications, from personalized media to dynamic content generation for gaming and virtual environments.

This work also sets the stage for future research in improving the efficiency and effectiveness of text-to-image models. Potential directions include exploring more sophisticated mechanisms for influence modulation, extending the approach to video and other media types, and further enhancing model generalization to unseen subjects and contexts.

In conclusion, RealCustom marks a significant advance in the field of text-to-image customization. Its novel framework not only addresses the limitations of existing approaches but also broadens the horizon for creative and practical applications of generative AI technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1764514136849846667

https://twitter.com/gm8xx8/status/1764514143699182072

https://twitter.com/javaeeeee1/status/1764628911411835144