Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 96 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Kimi K2 189 tok/s Pro
2000 character limit reached

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework (2504.12395v1)

Published 16 Apr 2025 in cs.CV

Abstract: Current learning-based subject customization approaches, predominantly relying on U-Net architectures, suffer from limited generalization ability and compromised image quality. Meanwhile, optimization-based methods require subject-specific fine-tuning, which inevitably degrades textual controllability. To address these challenges, we propose InstantCharacter, a scalable framework for character customization built upon a foundation diffusion transformer. InstantCharacter demonstrates three fundamental advantages: first, it achieves open-domain personalization across diverse character appearances, poses, and styles while maintaining high-fidelity results. Second, the framework introduces a scalable adapter with stacked transformer encoders, which effectively processes open-domain character features and seamlessly interacts with the latent space of modern diffusion transformers. Third, to effectively train the framework, we construct a large-scale character dataset containing 10-million-level samples. The dataset is systematically organized into paired (multi-view character) and unpaired (text-image combinations) subsets. This dual-data structure enables simultaneous optimization of identity consistency and textual editability through distinct learning pathways. Qualitative experiments demonstrate the advanced capabilities of InstantCharacter in generating high-fidelity, text-controllable, and character-consistent images, setting a new benchmark for character-driven image generation. Our source code is available at https://github.com/Tencent/InstantCharacter.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a scalable full-transformer adapter that fuses dual-stream features from pre-trained vision encoders into the diffusion model's latent space.
  • The paper proposes a progressive three-stage training strategy on a curated 10M sample dataset to enhance character consistency, text controllability, and high-resolution fidelity.
  • Experiments show that InstantCharacter outperforms previous methods by preserving character identity and details while accurately following complex text prompts and diverse artistic styles.

This paper introduces InstantCharacter, a framework designed for high-fidelity, open-domain character personalization in image generation, built upon a Diffusion Transformer (DiT) architecture. It addresses the limitations of existing methods: U-Net based approaches often lack generalization and image quality, while optimization-based methods require slow, subject-specific fine-tuning that degrades text controllability. Furthermore, traditional adapters designed for U-Nets don't scale well to large DiT models.

InstantCharacter leverages the power of modern DiTs (specifically FLUX1.0-dev) and proposes two main innovations:

  1. A Scalable Full-Transformer Adapter: This adapter connects the input character image features to the DiT's latent space.
    • It uses pre-trained vision encoders (SigLIP and DINOv2) instead of just CLIP to capture finer-grained details and robust features from the character image. Features from both encoders are concatenated.
    • Intermediate transformer encoders process these features using a dual-stream strategy: one stream extracts low-level features from shallow encoder layers, and the other extracts region-level features by processing image patches. These are refined and fused.
    • A timestep-aware Q-former acts as a projection head, mapping the refined character features into the diffusion model's denoising process via cross-attention layers. This design allows the adapter to scale effectively with large DiT models.
  2. A Progressive Three-Stage Training Strategy: This strategy utilizes a large-scale (10 million samples) dataset specifically curated for this task, containing both paired (multi-view character images for text control) and unpaired (single character images for consistency) data.
    • Stage 1 (Low-Res Pretraining): Uses unpaired data at 512x512 resolution to train the model for character consistency by reconstructing the input character.
    • Stage 2 (Low-Res Paired Training): Continues training at 512x512 resolution using paired data. This stage focuses on improving text controllability, teaching the model to generate the character in new poses, actions, and scenes based on text prompts, reducing simple copy-pasting.
    • Stage 3 (High-Res Joint Training): Fine-tunes the model at high resolution using both paired and unpaired data to enhance image fidelity and texture details.

Qualitative experiments show that InstantCharacter outperforms previous FLUX-based methods (OminiControl, EasyControl, ACE+, UNO) in preserving character identity and details while accurately following text prompts, even complex ones involving specific actions. Its results are shown to be comparable to the closed-source GPT-4o. The framework also demonstrates flexibility in applying different artistic styles (e.g., Ghibli, Makoto) using LoRAs without losing character consistency or text control.

The paper concludes that InstantCharacter sets a new benchmark for character-driven image generation by effectively combining a scalable adapter architecture with a tailored training strategy for DiTs, offering high fidelity, consistency, and controllability.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com