Few-shot multi-token DreamBooth with LoRa for style-consistent character generation (2510.09475v1)

Published 10 Oct 2025 in cs.CV and cs.LG

Abstract: The audiovisual industry is undergoing a profound transformation as it is integrating AI developments not only to automate routine tasks but also to inspire new forms of art. This paper addresses the problem of producing a virtually unlimited number of novel characters that preserve the artistic style and shared visual traits of a small set of human-designed reference characters, thus broadening creative possibilities in animation, gaming, and related domains. Our solution builds upon DreamBooth, a well-established fine-tuning technique for text-to-image diffusion models, and adapts it to tackle two core challenges: capturing intricate character details beyond textual prompts and the few-shot nature of the training data. To achieve this, we propose a multi-token strategy, using clustering to assign separate tokens to individual characters and their collective style, combined with LoRA-based parameter-efficient fine-tuning. By removing the class-specific regularization set and introducing random tokens and embeddings during generation, our approach allows for unlimited character creation while preserving the learned style. We evaluate our method on five small specialized datasets, comparing it to relevant baselines using both quantitative metrics and a human evaluation study. Our results demonstrate that our approach produces high-quality, diverse characters while preserving the distinctive aesthetic features of the reference characters, with human evaluation further reinforcing its effectiveness and highlighting the potential of our method.

Summary

The paper introduces a multi-token prompt strategy that disentangles character-specific features from shared style, achieving high fidelity in few-shot settings.
It employs LoRA for parameter-efficient fine-tuning, cutting training time from ~36 hours to ~2 hours and mitigating overfitting.
Quantitative metrics and human evaluations confirm that the clustering-based approach enhances style retention and output diversity across varied datasets.

Few-shot Multi-token DreamBooth with LoRA for Style-consistent Character Generation

Introduction and Motivation

The paper addresses the challenge of generating an unlimited number of novel, stylistically consistent characters from a small set of human-designed references—a task of high relevance in animation, gaming, and digital content creation. The core difficulty lies in learning both the shared artistic style and the unique visual traits of a character family from few-shot data, where textual prompts are insufficient to capture the nuances of style, and overfitting is a persistent risk.

Traditional approaches, such as GANs and text-to-image diffusion models (e.g., Stable Diffusion, DALL-E), require large datasets or rely on explicit textual conditioning, which is not feasible for bespoke or niche styles. DreamBooth, a popular fine-tuning method for diffusion models, enables the integration of new visual concepts via prompt-based token association but is limited by overfitting, high computational cost, and insufficient diversity when applied in few-shot, style-centric scenarios.

Methodology: Multi-token DreamBooth with LoRA

The proposed method extends DreamBooth by introducing a multi-token strategy and parameter-efficient fine-tuning via LoRA. The approach is designed to disentangle character-specific features from global style, enabling scalable and diverse character generation.

Multi-token Prompt Structure

During training, each character is associated with a unique token ([specific_id]), while a shared token ([shared_id]) captures the common style. The prompt format is:

1	[specific_id] [shared_id] style

[specific_id]: Unique per character, encodes individual traits.
[shared_id]: Shared across all characters, encodes style.
style: Class descriptor, leveraging the model's prior on artistic styles.

Token selection is performed via two strategies:

Rarest: Selects the rarest unused tokens to minimize interference with prior knowledge.
Clustered: Uses k-means clustering in the CLIP embedding space to maximize token diversity and prevent semantic overlap.
Figure 1: Diagram illustrating proposed multi-token DreamBooth with LoRA training, showcasing an example of training the model for style-consistent character generation.

LoRA-based Fine-tuning

LoRA (Low-Rank Adaptation) is employed to reduce the number of trainable parameters, enabling efficient adaptation of large diffusion models in few-shot settings. This significantly reduces training time (from ~36 hours for full DreamBooth to ~2 hours with LoRA) and mitigates overfitting.

Generation Phase: Unlimited Character Synthesis

At inference, new characters are generated by varying the [token | embedding] in the prompt:

Random Tokens: Sample unused rare tokens.
Random Embeddings: Inject sampled embeddings (univariate or multivariate Gaussian) into the text encoder, leveraging the distribution of rare token embeddings to synthesize novel identities.

This enables theoretically unlimited character generation while maintaining style coherence.

Experimental Setup

Datasets

Five few-shot datasets (10–30 images each) were curated, each containing characters with a consistent but hard-to-describe style:

Virus: Cartoon viruses (simple, color variation).
Scary: Cartoon monsters (complex, vibrant).
Daily: Full-body cartoon humans.
Hipster: Busts of cartoon humans (detailed faces).
Trans: Line-art humans with prosthetics (no color, high complexity).

Figure 2: Character datasets used for experiments.

Baselines and Training

Textual Inversion (TI)
DreamBooth (DB)
DreamBooth + LoRA (DB^L)
Proposed Multi-token DreamBooth + LoRA (DB_MT-R^L, DB_MT-C^L)

All models were trained on Stable Diffusion v1-5 with identical hyperparameters. For each method and dataset, five independent models were trained, and 400 images per model were generated using each generation strategy.

Evaluation Metrics

Fidelity: Measures style alignment using CLIP-based cosine similarity between generated and reference images.
Diversity: Standard deviation of CLIP embeddings across generations.
Validity: Proportion of outputs passing criteria for non-duplication, non-defectiveness, and style/subject consistency.

Human evaluation was conducted via pairwise comparisons (general participants) and absolute ratings (professional artists).

Results

Validity Analysis

The baseline DreamBooth (DB) exhibited severe overfitting, with >99% of outputs being near-copies of training images. Textual Inversion (TI) produced a high rate of defective images. DreamBooth+LoRA (DB^L) reduced overfitting but increased defective and multi-subject outputs, especially in human character datasets.

The proposed multi-token methods (DB_MT-R^L, DB_MT-C^L) achieved the lowest invalidity rates across all datasets, demonstrating robust generalization and style retention.

Figure 3: Examples of generated images that fail to meet the validity criteria, classified by the corresponding reasons for their rejection.

Quantitative Performance

Fidelity: The clustering-based multi-token method (DB_MT-C^L) consistently achieved the highest Fidelity scores across datasets, indicating superior style preservation.
Diversity: While slightly lower than DB^L, the multi-token methods maintained competitive diversity, with the Token strategy yielding the highest diversity but at a minor cost to Fidelity.
Generation Strategies: No single strategy dominated; Univariate and Multivariate embedding sampling enabled unlimited generation with strong style alignment.
Figure 4: Examples of generated images with the highest and lowest Fidelity scores for each dataset, model, and generation method. Fidelity scores are shown below each image.

Human Evaluation

General participants and professional artists both ranked the clustering-based multi-token method (DB_MT-C^L) highest in terms of stylistic fit and perceived quality, sometimes even preferring generated images over original dataset samples (notably in the Trans dataset).

Figure 5: Representative samples of generated characters shown to participants, organized by dataset and training method. Original datasets are included for reference.

Ablation Study

Ablation confirmed that:

Multi-token training is the primary driver of Fidelity gains.
Removing the regularization set eliminates major sources of invalidity and further improves style alignment, at a modest cost to diversity.

Implementation Considerations

Computational Requirements: LoRA-based fine-tuning enables practical training on consumer GPUs (e.g., NVIDIA 3080 Ti), with training times reduced to ~2 hours per model.
Scaling: The method is scalable to larger datasets and more complex styles, but the underlying diffusion model's biases and limitations (e.g., Stable Diffusion v1-5) may affect results.
Deployment: The approach is suitable for integration into creative pipelines for animation, gaming, and digital content, enabling rapid prototyping and expansion of character rosters with minimal manual intervention.
Limitations: Quantitative metrics may not fully capture subtle stylistic nuances; human evaluation remains essential for high-quality assessment. The method's effectiveness is contingent on the base model's representational capacity.

Implications and Future Directions

The proposed multi-token DreamBooth with LoRA establishes a robust framework for few-shot, style-consistent character generation, overcoming the overfitting and diversity limitations of prior methods. The explicit disentanglement of style and identity via token clustering, combined with parameter-efficient adaptation, enables scalable and practical deployment in creative industries.

Future work should explore:

Application to newer diffusion architectures (e.g., Flux, Stable Diffusion 3.5).
Development of refined, style-sensitive evaluation metrics.
Broader and more diverse human evaluations.
Extension to multi-modal or cross-domain style transfer scenarios.

Conclusion

This work demonstrates that clustering-based multi-token DreamBooth with LoRA enables high-fidelity, diverse, and scalable character generation from limited references, validated by both quantitative metrics and human judgment. The approach is computationally efficient and directly applicable to real-world creative workflows, with clear avenues for further research and refinement.