Key-Locked Rank One Editing for Text-to-Image Personalization

Published 2 May 2023 in cs.CV, cs.AI, and cs.GR | (2305.01644v2)

Abstract: Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

Abstract PDF Upgrade to Chat

Citations (139)

View on Semantic Scholar

Summary

The paper presents Perfusion, a novel method that uses a gated rank-1 editing mechanism to dynamically balance user-defined visual concepts and textual cues.
It innovatively splits the cross-attention module into distinct 'Where' and 'What' pathways, enabling precise control over spatial layout and detail enrichment.
Quantitative results show that Perfusion outperforms state-of-the-art baselines like DreamBooth while maintaining a compact 100KB model per concept for efficient deployment.

Essay on "Key-Locked Rank One Editing for Text-to-Image Personalization"

The paper at hand introduces Perfusion, a novel method for Text-to-Image (T2I) personalization, which addresses the inherent challenges associated with integrating user-provided visual concepts into existing diffusion-based T2I models. This method stands out for its approach in achieving a fine balance between visual fidelity and textual alignment.

Perfusion improves upon existing T2I approaches by deploying a dynamic rank-1 editing mechanism to modulate the behavior of T2I models without succumbing to overfitting. This is accomplished by segregating the typical cross-attention module into two distinct pathways: Where and What. The "Keys" channel the "Where" pathway by influencing the attention map layout, while the "Values" dictate the "What" by enriching visual details—the elemental attributes of the generated output. Perfusion aims to stabilize the Keys of user-defined concepts by anchoring them to broader, superordinate categories, a process the authors label "Key-Locking." This mechanism is posited to mitigate overfitting by limiting the spatial dominion of attention assigned to novel words, as evidenced in traditional Textual Inversion methods.

A central innovation of Perfusion lies in its gated rank-1 approach to update the weight of projection matrices. This framework not only finely tunes individual concepts during inference but also permits their combination within a unified framework. The model retains a lean structure, measuring merely 100KB per concept, an order of magnitude starkly efficient compared with prevailing models. This diminutive size underscores its practical applicability, particularly for settings requiring on-demand or distributed inference deployment.

Quantitative assessments highlight Perfusion's superior alignment with desired text prompts compared to state-of-the-art baselines, including DreamBooth and Custom-Diffusion. The runtime adjustable parameters in Perfusion further extend the control over the trade-off between visual fidelity and textual congruence, nuanced by sigmoid gating parameters. During experimentation, the model exhibits resilience to overfitting, showcasing versatility across a broad spectrum of tasks—from intricate object deformation to composing multi-concept scenarios.

Beyond practical implications, the theoretical implications of Perfusion suggest promising avenues for generalizable key-locking in attention-based models. Since Key-Locking operationally resembles conceptually oscillator stabilization in feedback control, it may yield insights for broader applications within machine learning where spatially and contextually adaptive attention is pivotal.

Future trajectories in AI research may benefit from expanding this gating and locking paradigm. Potential areas of exploration include cross-domain applications where maintaining conceptual integrity amidst substantial context shifts is critical. Given the expeditious growth in visual LLMs, Perfusion's concise model presents an intriguing scaffold upon which subsequent light-weight and adaptive personalization frameworks might be crafted.

In conclusion, Perfusion articulates an innovative stride in the quest for high-fidelity, contextually rich T2I models. Its dual-focus on avoiding conceptual overfit and maximizing alignment fidelity marks a significant contribution to diffusion model customization, with implications that resonate beyond graphics to computational creativity at large. The method sets a benchmark by combining compactness with perceptual richness, an endeavor of significant import as AI continues to traverse deeper into personalized generative modeling.

Markdown