DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

Published 2 Oct 2024 in cs.CV | (2410.02067v2)

Abstract: In the realm of image generation, creating customized images from visual prompt with additional textual instruction emerges as a promising endeavor. However, existing methods, both tuning-based and tuning-free, struggle with interpreting the subject-essential attributes from the visual prompt. This leads to subject-irrelevant attributes infiltrating the generation process, ultimately compromising the personalization quality in both editability and ID preservation. In this paper, we present DisEnvisioner, a novel approach for effectively extracting and enriching the subject-essential features while filtering out -irrelevant information, enabling exceptional customization performance, in a tuning-free manner and using only a single image. Specifically, the feature of the subject and other irrelevant components are effectively separated into distinctive visual tokens, enabling a much more accurate customization. Aiming to further improving the ID consistency, we enrich the disentangled features, sculpting them into more granular representations. Experiments demonstrate the superiority of our approach over existing methods in instruction response (editability), ID consistency, inference speed, and the overall image quality, highlighting the effectiveness and efficiency of DisEnvisioner. Project page: https://disenvisioner.github.io/.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces DisEnvisioner, a method that disentangles and enriches visual prompts for achieving high-fidelity customized image generation.
It utilizes two components—DisVisioner for feature disentanglement and EnVisioner for feature enhancement—to preserve subject identity while filtering noise.
Extensive experiments on the DreamBooth dataset demonstrate improved text and image alignment metrics along with efficient inference.

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

The paper introduces "DisEnvisioner," a novel approach addressing the challenges inherent in customized image generation. The method enhances image customization fidelity by disentangling and enriching visual prompts using diffusion models, without the need for cumbersome tuning or multiple reference images.

Overview of the Approach

DisEnvisioner tackles the pervasive issue in existing methodologies where subject-irrelevant attributes from visual prompts infiltrate and degrade the customization quality. The primary innovation lies in the framework's ability to efficiently isolate and amplify subject-essential features while filtering out extraneous details. This is achieved by leveraging two key components—DisVisioner and EnVisioner—which work collaboratively to ensure accurate and high-quality image generation.

DisVisioner: This component operates as an image tokenizer, disentangling the image features into distinct subject-essential and irrelevant tokens. The use of a mutual exclusivity principle in the attention mechanism ensures the segregation of meaningful attributes from noise.
EnVisioner: Following disentanglement, EnVisioner refines and enhances the subject-essential features into granular tokens. This enrichment process bolsters ID consistency, ensuring high fidelity in the depiction of subject identity and overall visual quality.

Experimental Validation

The method was evaluated extensively on the DreamBooth dataset, encompassing 158 images across 30 distinct subjects with 25 editing prompts per subject. The paper presents comprehensive experiments demonstrating DisEnvisioner’s superiority over existing methods in several key metrics:

Text-Alignment (C-T): Indicates the method’s proficiency in adhering to textual instructions, showing improved editability.
Image-Alignments (C-I and D-I): These metrics suggest enhanced ID consistency over existing models, ensuring the subject's identity remains intact.
Internal Variance (IV): Demonstrates robustness by maintaining low variance, indicating resilience against irrelevant visual disturbances.
Inference Time (T): The method offers competitive inference speed, beneficial for practical application scenarios.

The results underline DisEnvisioner's effectiveness in balancing editability and ID consistency, vital for realistic and faithful image customization.

Implications and Future Directions

DisEnvisioner successfully bridges the gap between accurate customization and efficient processing in image generation tasks. By disentangling visual prompts to focus solely on subject-essential features, it sets a new standard for how customized images can be generated without heavy computational burdens. The implications for practical application are significant, especially in fields requiring real-time and personalized image synthesis, such as virtual reality, video games, and media content creation.

Looking forward, this research could spur advancements in several domains:

Scalability and Adaptability: Enhancing DisEnvisioner to handle more complex scenes with multiple subjects or dynamic environments could broaden its application scope.
Integration with Other Modalities: Merging this approach with multimodal learning could offer richer, more context-aware image generation.
Theoretical Insights: Further exploration into disentanglement principles using the proposed method could provide deeper insights into feature representation in machine learning.

In summary, DisEnvisioner presents a forward-thinking solution to the complex issue of personalized image synthesis, offering both theoretical contributions and practical utility in artificial intelligence and computer vision fields.

Markdown