TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder (2409.08248v1)

Published 12 Sep 2024 in cs.CV

Abstract: Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.

Authors (3)

NaHyeon Park (3 papers)
Kunhee Kim (8 papers)
Hyunjung Shim (47 papers)

Summary

TextBoost: Advancements in One-Shot Personalization of Text-to-Image Models via Text Encoder Fine-tuning

The paper "TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder" presents a novel approach to address the challenges inherent in personalizing text-to-image models with a singular reference image. The key innovation lies in the selective fine-tuning of the text encoder, diverging from traditional methodologies that predominantly focus on the image module.

Challenges in Existing Methods

Current popular techniques, including DreamBooth and Textual Inversion, necessitate multiple reference images (usually 3 to 5) to generate high-quality outputs. Such methods, when constrained to a single image, tend to overfit, leading to near-duplicate outputs regardless of varying text inputs. This limitation significantly hampers their applicability in scenarios where only one reference image is available.

Proposed Methodology: TextBoost

TextBoost introduces a selective fine-tuning strategy centered on the text encoder. This approach is underpinned by the observation that the text encoder weights undergo the most substantial changes during fine-tuning, compared to the relatively minor alterations seen in the U-Net layers of the image module. The proposed method introduces three complementary techniques to enhance the performance of one-shot personalization:

Augmentation Tokens: This technique mitigates overfitting by disentangling subject-related and irrelevant features. The method utilizes paired data augmentation with augmentation tokens inserted in the text prompt, ensuring that the model learns to map augmentation-specific transformations to these tokens. This strategy results in more generalized, diverse image outputs without direct augmentation leakage into the generated images.
Knowledge Preservation Loss: To counteract language drift, which occurs when a fine-tuned model loses its integrity in understanding diverse natural language prompts, TextBoost employs a knowledge preservation loss. This is achieved by maintaining the cosine similarity between text embeddings from the original and fine-tuned text encoders, thus ensuring the model retains its generalization capabilities across varied prompts.
SNR-Weighted Timestep Sampling: Recognizing that the significance of the text prompt diminishes with decreasing noise levels during the denoising process, TextBoost biases timestep sampling towards higher noise levels. This focuses the training on phases where the impact of the text input is more pronounced, thus enhancing the model's ability to incorporate text-based modifications effectively.

Experimental Validation

Extensive experiments demonstrate that TextBoost effectively generates high-quality, diverse images using only a single reference image. The model's performance is quantitatively validated using metrics such as CLIP image embedding similarity (CLIP-I) for subject fidelity and CLIP text-image similarity (CLIP-T) for text fidelity. TextBoost exhibits a marked improvement in text-image alignment, outperforming other methods both in terms of image quality and computational efficiency. It requires significantly fewer trainable parameters—0.7 million, compared to DreamBooth's 865.9 million—and much less storage per customized model (5.1 MB as opposed to 3.3 GB for DreamBooth).

User Study and Qualitative Results

A large-scale user paper corroborates the quantitative findings. Participants favor the images generated by TextBoost for their alignment with both the subject and the text prompt. Qualitative evaluations further illustrate TextBoost’s proficiency in applying various prompts, be it modifications in object properties, styles, or scenes.

Disentanglement and Diversity

Further analysis of cross-attention maps shows that TextBoost excels in disentangling the subject from background elements, unlike other methods that struggle in scenarios involving complex backgrounds. Additionally, the method demonstrates higher diversity in generated outputs, as quantified by inter-image similarity metrics, ensuring that the generated images do not suffer from excessive uniformity.

Future Directions

The theoretical underpinnings and empirical successes of TextBoost suggest several promising avenues for future research. Potential developments include extending this approach to multi-concept personalization, enhancing fine-tuning efficiency for even broader text-to-image model applications, and exploring its applicability in real-time user engagement scenarios.

In conclusion, TextBoost represents a substantial advancement in the personalization of text-to-image models, particularly in contexts constrained by single reference images. By fine-tuning the text encoder and implementing augmentation tokens, knowledge preservation loss, and SNR-weighted sampling, TextBoost addresses the prevalent issues of overfitting and limited diversity, setting a new benchmark for one-shot personalization in generative image models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1834409792929948006

https://twitter.com/gpbhupinder/status/1834736704692855201

https://twitter.com/arXivGPT/status/1835423497578840511

https://twitter.com/arXivGPT/status/1835786031372902716

https://twitter.com/javaeeeee1/status/1834707750556971290

https://twitter.com/gm8xx8/status/1834410327527526820