Character Customization Protocol

Updated 4 February 2026

Character Customization Protocol is a framework that maps diverse user inputs—text, images, or keys—to structured character representations in interactive systems.
It employs data-driven synthesis, supervised fine-tuning, and adapter techniques to ensure output consistency, diversity, and in-character fidelity across modalities.
The protocol integrates parameter-space avatar control, diffusion processes, and knowledge graphs to enhance narrative coherence and visual realism in digital applications.

Character customization protocols encompass technical workflows and mathematical objectives for specifying, instantiating, and controlling persona features in interactive systems, digital games, LLMs, and image/video generative pipelines. These protocols define how user input—text, reference media, profile keys, or other controls—are mapped to internal character or persona representations, which then condition model outputs to achieve fidelity, diversity, and consistency. Contemporary approaches span data-driven (large-scale synthetic or user-generated datasets), optimization-based, and adapter-based (parameter-efficient/frozen-backbone) frameworks. This domain integrates text, image, and multi-modal machine learning, and underpins applications in conversational agents, storytelling, avatar embodiment, and role-playing games.

1. Synthetic Persona Generation and Fine-Tuning in LLMs

Protocols such as OpenCharacter (Wang et al., 26 Jan 2025) and SimsChat (&&&1&&&) systematize persona-driven LLM customization by leveraging large-scale synthetic data and structured persona representations.

Persona Corpus Construction: Large sets (e.g. 200,000) of persona seeds—short descriptors from resources like Persona Hub—are enriched via LLM prompting (e.g. GPT-4o), which deterministically expands each seed into a structured character profile with attributes: name, age, gender, race, birthplace, appearance, general experience, and personality.
Instruction Data Synthesis:
- Response Rewriting (OpenCharacter-R): Existing instruction–response pairs from datasets like LIMA or Alpaca are systematically aligned with sampled persona profiles by prompting a high-capacity LLM to rewrite responses in the assigned style, background, or personality.
- Response Generation (OpenCharacter-G): For each instruction/profile pair, the LLM generates a new response from scratch, emphasizing style and adherence to the persona.
Supervised Fine-Tuning (SFT): The synthetic corpus (e.g. ~306,000 instruction–persona–response triples) is used for SFT of a base LLM (LLaMA-3-8B), with cross-entropy loss minimized over responses conditioned on instruction and persona profile:

$\theta^* = \arg\min_\theta - \sum_{i=1}^N \sum_{t=1}^{|y_i|} \log P_\theta(y_i^{(t)} \mid x_i, C_i, y_i^{(<t)} )$

Persona Injection at Inference: The system prompt unambiguously enumerates the persona, and downstream responses must remain in-character for the session. For systems like SimsChat, JSON-based structured persona profiles are injected verbatim at each dialogue turn.
Evaluation: Benchmark suites such as PersonaGym (Wang et al., 26 Jan 2025) or WikiRoleEval (Yang et al., 2024) assess role-playing agents via consistency, action justification, toxicity control, linguistic habits, and expected action; metrics are averaged across held-out personas and axes.

2. Parameter-Space Protocols for 3D/Avatar Customization

Many protocols model character appearance as a vector of continuous and discrete parameters interpretable by 3D engines or avatar-generation systems. This paradigm underlies methods such as ICE (Wu et al., 2024), EasyCraft (Wang et al., 3 Mar 2025), and T2P (Zhao et al., 2023).

Parameter Representation: Each avatar is parameterized by high-dimensional continuous vectors (e.g. 269D–450D for bone positions, facial features, makeup) and discrete selections (e.g. hairstyle category, beard style, makeup type).
Photo- or Text-to-Parameter Translation:
- Image-Based: Photo inputs are processed via a pretrained vision transformer (ViT; self-supervised MAE (Wang et al., 3 Mar 2025)), whose [CLS] embedding is mapped to engine parameters with a multi-head MLP. Losses enforce reconstruction of known pairs and categorical alignment for discrete attributes.
- Text-Based: Text prompts are encoded by CLIP T and mapped either directly (by fine-tuned neural translators (Zhao et al., 2023)) or via text-to-image diffusion (engine-style) followed by the image pipeline (EasyCraft).
- Multi-Round Editing: Protocols such as ICE define a dialogue-feedback loop in which each user utterance is parsed by LLM into a structured edit command and intensity, which are then used by a latent parameter solver (guided by CLIP loss and prior regularization) to update only targeted regions of the parameter vector.
Editing via Semantic Localization: Transformer-based localizers produce sparse masks indicating controllable parameters for a given edit instruction; masked optimizations update only relevant parameters (ICE Eq. 4 & 5).
Unified Learning Objectives:

$\mathcal{L}_{\mathrm{SSL}} = L_{MAE} + \beta L_{contra}$

$\mathcal{L}_{T} = \alpha\|\hat{\delta}_s - \delta_s\|_1 + \gamma\|(\hat{\delta}_a - \delta_a) \odot M \|_1 + \lambda L_{CE}$

$\mathcal{L}_{CLIP}(T, G(D(z))) = 1 - \cos(E_T(T), E_I(G(D(z))))$

Performance: Standard metrics include identity similarity (ArcFace cosine), Inception Score, FID (Fréchet Inception Distance), CLIP score, and subjective user preference (Wang et al., 3 Mar 2025, Wu et al., 2024). ICE achieves near-instant iterative parameter editing (<10 s/round) and superior text consistency over prior systems.

3. Adapter-Based and Diffusion Transformer Protocols

Recent advances introduce scalable, parameter-efficient adapters for open-domain character-driven image generation, targeting text-to-image, multi-character, and video synthesis scenarios.

Modular Adapter Injection: InstantCharacter (Tao et al., 16 Apr 2025) and CharCom (Wang et al., 11 Oct 2025) both employ frozen text-to-image diffusion backbones, with character identity specified via external, parameter-efficient adapters.
- InstantCharacter: Stacked Transformer-based adapters extract regional and low-level features from reference images (using pre-trained SigLIP/DINOv2), fuse them, and feed distilled query tokens into cross-attention layers of a DiT backbone at each diffusion step. Identity consistency and textual editability are jointly enforced via composite loss during staged training.
- CharCom: Each character receives a dedicated low-rank LoRA adapter with text/image “trigger” tokens. At inference, prompt analysis computes per-character weights $w_c$ by CLIP/T5 similarity, and adapters are composed as weighted sums directly into frozen U-Net weights:
$\theta^* = \theta + \sum_{c \in S} w_c \Delta\theta_c$

Parameter efficiency is substantial (21K LoRA params per character vs. full fine-tune of 860M backbone) (Wang et al., 11 Oct 2025).
Region and Prompt-Guided Control: Character-Adapter (Ma et al., 2024) decomposes reference and layout images into semantic regions using prompt-guided segmentation, then fuses dynamic cross-attention from these regions (CLIP-encoded) during denoising. Mask-based and soft-attention fusion mechanisms ensure fidelity and flexibility at inference.
Diffusion Alignment for New Characters: Protocols such as EpicEvo (Wang et al., 2024) address new character insertion for story visualization by leveraging adversarial character alignment losses within internal latent space, knowledge distillation to prevent catastrophic forgetting, and single-example adaptation using a modified DDIM sampler.

4. Knowledge Graph and World Modeling Protocols

Recent customization paradigms extend from parameter or prompt-centric approaches to knowledge-enhanced models of character identity and interrelationships, enabling narrative and contextual coherence.

Character Graphs (CGs): StoryWeaver (Zhang et al., 2024) formalizes the narrative world as an attributed directed graph $G=(V,E,A)$ , where vertices $V$ represent characters, $A$ attributes (appearance, clothing, traits), and $E$ typed edges (interpersonal relations, actions).
- Construction: Character images and captions are analyzed by VLMs and parsers to extract and aggregate attribute mappings and inter-character relations.
- Customization: Edits $\Delta A$ to node attributes are incorporated by appending or overwriting relevant properties in the graph, with semantic conflict checks.
- Spatial Guidance: Each character's graph-encoded appearance informs per-frame spatial priors (Gaussian fields), guiding diffusion cross-attention assignments in multi-character visual scenes.
Generation Pipeline: StoryWeaver concatenates character appearances and attributes, event, and style text into a joint generation prompt, then applies knowledge-enhanced spatial gating during U-Net sampling. Performance is quantitatively validated by improvements in identity preservation (DINO-I) and text alignment (CLIP-T).

5. Extensions to Video and One-Shot Customization

Protocols for customizable character video synthesis (MovieCharacter (Qiu et al., 2024)) and low-shot/new-character adaption (EpicEvo (Wang et al., 2024)) introduce modular and tuning-free systems for dynamic applications:

MovieCharacter Protocol: Decomposes video synthesis into independent modules—segmentation/tracking (SAM2), video inpainting (ProPainter), motion imitation (pose-guided latent diffusion), and compositional video assembly (PCTNet harmonization, edge-aware refinement). APIs are specified for each module, supporting external motion input, multi-character handling, and alternative module replacement.
One-Shot Protocols: EpicEvo aligns new characters in generative diffusion models by adversarially training discriminators in latent space using limited reference frames. Knowledge distillation from frozen teachers preserves prior background and character features during adaptation. These approaches demonstrate superior integration of new characters into existing narrative arcs with only a single example.

6. Best Practices, Data Requirements, and Deployment

Data Scale and Diversity: Successful protocols require extensive persona/profile libraries (e.g., $O(10^4)$ – $O(10^5)$ profiles and dialogues (Wang et al., 26 Jan 2025)), multi-view image sets, and paired/unpaired datasets supporting both identity learning and textual control (Tao et al., 16 Apr 2025, Wang et al., 3 Mar 2025).
Hyperparameters: Effective training employs large frozen backbones (LLMs/Diffusion Transformers), Adam or AdamW optimizers, linear decay/warmup, and multi-GPU parallelism (e.g., Megatron-LM tensor parallelism (Wang et al., 26 Jan 2025)).
User/Developer Guidelines: Maintaining internal profile consistency, richly describing attributes, and careful semantic disambiguation are critical for downstream in-character performance. Customization engines should preserve the design schema, support iterative editing, and align text and character features through explicit loss functions.
Evaluation: Protocols are benchmarked on role-playing, identity preservation, text-image compliance, and perceptual realism using held-out suites and human/auto evaluation metrics. Methods are compared quantitatively on FID, CLIP alignment, temporal consistency scores, and user study preference ratios, with ablative studies ensuring the necessity of each protocol component.

7. Protocol Landscape: Schematic Comparison

Protocol Family	Input Modality	Core Mechanism	Output Domain	Key References
Persona Data SFT	Persona seed, text	Synthetic profile+SFT	LLM dialogue/role-play	(Wang et al., 26 Jan 2025, Yang et al., 2024)
Text/Image-to-Param	Text/image, dialogue	Encoders, neural transl.	Param. avatar/game model	(Wang et al., 3 Mar 2025, Wu et al., 2024, Zhao et al., 2023)
Adapter Diffusion	Reference images, prompt	Adapter-injected DiT	Text-to-image, video, story	(Tao et al., 16 Apr 2025, Ma et al., 2024, Wang et al., 11 Oct 2025)
Knowledge Graph	Image/caption, attribute set	CG + spatial guidance	Multichar, narrative scenes	(Zhang et al., 2024)
Video/One-shot	Frames, pose, 1-shot story	Seg-inpaint-diffusion, adv. alignment	Video, new stories	(Qiu et al., 2024, Wang et al., 2024)

This typology reflects a progression: from discrete parametric avatar encoding (sliders, selectors), through dialog-based LLM role-play, to compositional adapter-based story and visual narrative generation, all united by rigorous data schemas, prompt/representation engineering, and mathematically principled training/fine-tuning objectives.

Markdown Upgrade to Chat

References (11)

OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas (2025)

Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework (2024)

ICE: Interactive 3D Game Character Editing via Dialogue (2024)

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting (2025)

Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation (2023)

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework (2025)

CharCom: Composable Identity Control for Multi-Character Story Illustration (2025)

Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization (2024)

Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models (2024)

10.

StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization (2024)

11.

MovieCharacter: A Tuning-Free Framework for Controllable Character Video Synthesis (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Character Customization Protocol.

Character Customization Protocol

1. Synthetic Persona Generation and Fine-Tuning in LLMs

2. Parameter-Space Protocols for 3D/Avatar Customization

3. Adapter-Based and Diffusion Transformer Protocols

4. Knowledge Graph and World Modeling Protocols

5. Extensions to Video and One-Shot Customization

6. Best Practices, Data Requirements, and Deployment

7. Protocol Landscape: Schematic Comparison

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Character Customization Protocol

1. Synthetic Persona Generation and Fine-Tuning in LLMs

2. Parameter-Space Protocols for 3D/Avatar Customization

3. Adapter-Based and Diffusion Transformer Protocols

4. Knowledge Graph and World Modeling Protocols

5. Extensions to Video and One-Shot Customization

6. Best Practices, Data Requirements, and Deployment

7. Protocol Landscape: Schematic Comparison

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research