Object-level Visual Prompts for Compositional Image Generation

Published 2 Jan 2025 in cs.CV, cs.AI, and cs.GR | (2501.01424v1)

Abstract: We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Abstract PDF Chat (Pro)

Summary

The paper introduces a novel KV-mixed cross-attention mechanism that uses separate encoders for layout control (keys) and appearance details (values) to enable compositional image generation with visual prompts.
The method incorporates object-level compositional guidance during inference to enhance identity preservation and layout adherence for individual visual prompts within complex scene compositions.
Results show the method outperforms existing techniques in maintaining object identity and inter-object differentiation in complex compositions, validated by objective metrics and qualitative examples.

The paper "Object-level Visual Prompts for Compositional Image Generation" innovates on text-to-image diffusion models by introducing the concept of object-level visual prompts. These visual prompts are formulated to enable the generation of semantically coherent compositions, not merely driven by text but through a blend of foreground and background visual elements. The proposed methodology is particularly focused on resolving the challenge of maintaining the identity integrity of individual visual prompts while facilitating diverse and flexible compositions across different scenes and styles.

Key Contributions

KV-Mixed Cross-Attention Mechanism:
- The central technical contribution of this work is the KV-mixed cross-attention. This mechanism distinctively learns keys and values from disparate visual encoders, segregating layout control and appearance detail.
- The keys are derived from a coarse-grained encoder with constrained capacity to guide layout control, while the values are sourced from a fine-grained encoder capturing detailed appearance attributes.
Compositional Guidance During Inference:
- The method incorporates a novel object-level compositional guidance which enhances both identity preservation and layout adherence during the generation process. This guidance mitigates potential pitfalls of traditional approaches that might merge or obscure intended object characteristics.
Trade-off Between Identity and Diversity:
- The study thoroughly examines the identity preservation-diversity trade-off that occurs when using image-based prompts through different encoder configurations. The mixed-granularity encoder approach ensures that identity specifics are maintained without sacrificing the potential for layout diversity.

Methodology

The paper utilizes a pre-trained text-to-image diffusion backbone, incorporating new architectures for embedding visual prompts. These are processed through dual encoders to produce keys focused on spatial layout and values oriented towards maintaining fine appearance details.
The layout tokens and appearance tokens are fused into the diffusion model's process, effectively influencing scene composition.

Results

The proposed method successfully generates complex scene compositions with a fidelity that respects the unique characteristics of each visual prompt. The approach demonstrates superiority over existing methods such as IP-Adapters and single-concept personalization models, which either fail to maintain object identity or lack inter-object differentiation.
Objective assessments using indicators like DINOv2 and CLIP-based identity metrics validate the method’s ability to balance detail fidelity and the generating of diverse compositions.
Qualitative examples illustrate the model's robustness in maintaining prompt-specific object identities while adapting those identities in diverse and plausible configurations.

Conclusion

The approach outlined provides a powerful new framework for compositional image generation using visual prompts, transcending the limitations of purely text-based prompts and offering more granular control over generated content. The results obtained using this method show clear advances in maintaining identity characteristics while providing flexibility in generated scene aesthetics, opening avenues for more tailored generative applications in creative fields.

Overall, this paper's contribution lies in its novel integration of visual prompts, along with detailed methodological innovations to harmonize identity preservation with compositional flexibility in image synthesis tasks. It sets a promising precedent for future research in semantically controlled, visually coherent image generation based on both textual and visual inputs.