Identity-Preserving Text-to-Video Generation

Updated 9 May 2026

The paper introduces diffusion transformer architectures with specialized identity-aware modules to ensure consistent subject features across video frames.
It employs multi-modal fusion and spatial–temporal decoupling techniques, balancing prompt semantics with high-fidelity identity preservation.
Experimental evaluations demonstrate improvements in metrics such as FaceSim and FID, underscoring robust identity maintenance and realistic motion dynamics.

Identity-preserving text-to-video generation (IPT2V) refers to the class of generative models—primarily based on large-scale diffusion transformers—that synthesize videos adherent to a text prompt while maintaining high-fidelity visual consistency of a specified subject’s identity, typically specified via one or more reference images or video clips. The IPT2V task poses unique algorithmic and architectural challenges not encountered in generic text-to-video generation, including complex spatial, temporal, and semantic trade-offs arising from the need to both preserve identity under diverse motion, poses, and occlusions and to enable prompt-driven semantic control.

1. Problem Formulation and Core Challenges

IPT2V can be formalized as follows: given

a natural language text prompt $T$ describing the desired scene, action, and/or attributes,
and a reference visual corpus $\mathcal{R}$ (ranging from a single face image to multiple images or even reference videos) describing the subject whose identity should be maintained,

generate a video $V = \{F_1, ..., F_T\}$ such that for all $i$ : (1) $F_i$ semantically conforms to $T$ , (2) the depicted subject in $F_i$ matches the identity in $\mathcal{R}$ , and (3) identity, motion, and appearance are consistent across all frames.

The principal challenges underlying IPT2V include:

High-fidelity multi-frame identity preservation: Maintaining face/body structure, texture, and idiosyncratic features under large pose, expression, lighting, and occlusion variations—especially in unconstrained scenes (Wang et al., 6 May 2026).
Spatial–temporal trade-off: Optimizations that emphasize static spatial layout often degrade temporal dynamics (causing frozen or jerky motion), while prioritizing motion can lead to identity drift and semantic mismatch (Wang et al., 7 Jul 2025).
Semantic–identity conflicts: Discrepancies between prompt instructions (e.g., changing clothing or age) and the identity reference may induce incoherent results or ambiguous attribute mixing (Wei et al., 23 Jan 2025, Gao et al., 1 Sep 2025).
Tuning cost and generality: Standard T2V personalization relied on fine-tuning for each identity or text-target, incurring prohibitive data and compute expenses and introducing training–inference distribution shifts that degrade motion and semantics (Li et al., 2024).

2. Architectural Paradigms: Diffusion Transformers and Specialized Components

The dominant IPT2V approaches are built atop large diffusion transformers (DiTs) or similarly expressive latent video diffusion models. To enforce identity fidelity, these architectures integrate specialized design patterns:

Feature-level multimodal fusion: Dual-branch or cross-attention modules fuse identity embeddings (from ArcFace, CLIP, etc.) with prompt text at various backbone locations (Wei et al., 23 Jan 2025, Xie et al., 5 Aug 2025, Mai et al., 8 Dec 2025).
Spatial–temporal decoupling: Decoupled pipelines inject spatial (identity) information early—often via T2I pre-generation and global feature adapters—then modulate temporal dynamics via separate motion prompts or temporal adapters (Wang et al., 7 Jul 2025).
Local/part-based routing: Fine-grained facial segmentation (e.g., per-region token routing) enables transformer models to preserve distinctive local characteristics and suppress global feature interference, as in the local router of LaVieID (Song et al., 11 Aug 2025).
Pose-aware and 3D priors: Incorporating explicit 3D geometry via mesh-based modules (e.g., DECA/SpiralNet++ in FantasyID) or pose-prior encoders allows for pose-faithful identity maintenance under large viewpoint changes (Wang et al., 6 May 2026, Zhang et al., 19 Feb 2025).
Temporal autoregressive or chunkwise rectification: Temporal consistency is promoted by chunking latent sequences and autoregressively refining token biases, as realized in LaVieID’s temporal autoregressive module (Song et al., 11 Aug 2025) and MoCA’s hierarchical temporal pooling (Xie et al., 5 Aug 2025).

3. Training Objectives, Losses, and Reward Formulations

All IPT2V methods rely on standard latent diffusion loss formulations, supervising denoising predictions at randomly sampled timesteps: $\mathcal{L}_{\rm diff} = \mathbb{E}_{t, z_0, \epsilon} \parallel \epsilon - \epsilon_\theta(\sqrt{\bar a_t} z_0 + \sqrt{1-\bar a_t} \epsilon, t) \parallel^2$ However, distinctive identity-aware objectives are utilized:

Identity similarity loss: Cosine similarity in ArcFace or CurricularFace space between generated and reference frames, typically aggregated across sequence (Wang et al., 6 May 2026, Li et al., 2024, Zhang et al., 19 Feb 2025).
Contrastive pose-invariant loss: InfoNCE-style losses enforce that different poses of the same ID are mapped to closely aligned embeddings (FaithfulFaces) (Wang et al., 6 May 2026).
Region-aware or face-centric weighting: Diffusion loss is sparsified or weighted in high-motion facial regions for enhanced dynamic fidelity (MotionCharacter) (Fang et al., 2024).
Adversarial and perceptual regularization: Encourage realism and perceptual similarity in facial and background regions (Xie et al., 5 Aug 2025).
Reinforcement learning with direct preference optimization (DPO), grouped PPO, or reward maximization: Human rater preference datasets and identity/semantic rewards are used for RLHF-style fine-tuning (AnyID, ID-Composer) (Wang et al., 26 Mar 2026, Pan et al., 1 Nov 2025).

4. Algorithmic Innovations, Trade-offs, and Ablations

Methodological advances across recent IPT2V work include:

Multi-reference fusion: AnyID and similar models fuse heterogeneous references—faces, portraits, videos—using VAE encoders and time/feature axis concatenation, with a primary-reference anchor to resolve inter-image conflicts (Wang et al., 26 Mar 2026).
Prompt engineering and input alignment: Systems like TPIGE and ContextAnyone enhance prompts via attribute extraction (GPT-4o) and correct image–prompt mismatch via identity-preserving image generators (Gao et al., 1 Sep 2025, Mai et al., 8 Dec 2025).
Mixture-of-experts and dynamic adaptation: MoCA’s injection of direct/temporal cross-attention at varying timescales (HTP experts) dynamically adjusts to required spatiotemporal context by router gating (Xie et al., 5 Aug 2025).
Training-free adapters and injection: I2V-Adapter and similar frameworks plug into frozen T2V models without further backbone tuning, leveraging cross-frame attention and identity priors for light-weight deployment (Guo et al., 2023).

Ablation studies consistently show that removal or weakening of spatial, temporal, or frequency-adaptive modules leads to significant drops in FaceSim, FID, or temporal smoothness, underscoring the necessity of integrated, multi-level identity control (Song et al., 11 Aug 2025, Wang et al., 7 Jul 2025, Xie et al., 5 Aug 2025, Yuan et al., 2024, Zhang et al., 19 Feb 2025).

5. Quantitative Results and Benchmarks

State-of-the-art IPT2V models are systematically evaluated on curated identity-video datasets (e.g., CelebIPVid, custom in-house corpora) using metrics including:

FaceSim (ArcFace/CurricularFace cosine similarity)
FID/FVD (visual fidelity, both global and face-cropped)
Temporal Consistency Score (TCS)
Subject/background consistency (VBench)
CLIPScore and semantic alignment
Human raters for identity and realism

Notable comparative results (FS=FaceSim, FID=Fréchet Inception Distance):

Method	FS-Cur↑	FID↓	Temporal↑	Noteworthy Features
LaVieID	0.425	174.1	0.773	Local router + TAM
MoCA	0.60	—	0.976	MoCA, HTP, perceptual loss
ConsisID	0.60	151.8	—	Frequency decomposition (no tuning)
FaithfulFaces	0.568	164.2	—	Pose-faithful aligner
AnyID	0.735*	—	>91.12%	Multi-view fusion, RLHF

(*Holi-Arc, not directly comparable; full table in (Wang et al., 6 May 2026, Song et al., 11 Aug 2025, Xie et al., 5 Aug 2025, Yuan et al., 2024, Wang et al., 26 Mar 2026))

User studies and domain-specific challenges (e.g., ACM Multimedia 2025) further underscore gains in identity, prompt-adherence, and subject–scene consistency for top models.

6. Limitations and Future Directions

Despite progress, the field faces several open issues:

Generalization to multiple and non-human subjects: Most pipelines are limited to single faces/bodies, with multi-person interaction still challenging for identity disentanglement (Pan et al., 1 Nov 2025).
Pose/expression and occlusion robustness: Even advanced aligner models can fail under extreme head rotations or heavy occlusion, indicating the need for more pose-invariant or generative 3D priors (Wang et al., 6 May 2026, Zhang et al., 19 Feb 2025).
Semantic–identity trade-offs: Over-strong identity injection risks "copy-paste" artifacts or resistance to prompt-driven edits; weak injection causes drift or genericization (Wei et al., 23 Jan 2025, Yuan et al., 2024).
Data and metric constraints: Availability of truly high-diversity, pose-varied video datasets and correlational alignment between proxy metrics and human identity perception remain inadequate (Yuan et al., 2024, Wang et al., 6 May 2026).
Long-horizon and scene-scale consistency: Most benchmarks operate on short temporal windows; future work will require scalable hierarchical modeling and more explicit temporal abstractions (Atzmon et al., 2024, Mai et al., 8 Dec 2025).
Plug-and-play deployment: There is continued demand for robust, training-free modules compatible with arbitrary T2V backbones (e.g., community models) (Guo et al., 2023).

Potential directions include:

Adaptive frequency or part-aware identity controllers
RLHF and preference-driven training at scale
Explicit 3D-aware consistency
Perceptually aligned evaluation metrics
Extending to stylized, open-domain, or multi-actor IPT2V scenarios

7. Representative Methods: Comparative Overview

Model	Architectural Core	Distinctive Mechanism	Tuning-Free	Multi-Ref	Pose/3D Prior	RL/Reward
LaVieID (Song et al., 11 Aug 2025)	DiT w/ local + TAM	Local router, temporal AR correction	✓
MoCA (Xie et al., 5 Aug 2025)	DiT w/ MoE CA layers	Hierarchical temporal pooling	✓
ConsisID (Yuan et al., 2024)	DiT + Freq. injection	Low/high-freq face extractors	✓
AnyID (Wang et al., 26 Mar 2026)	DiT, flow-matching	Multi-ref fusion, anchor/delta prompt, RLHF	✓	✓		✓
FantasyID (Zhang et al., 19 Feb 2025)	DiT + 3D/2D fusion	3D mesh prior, multi-view, learnable injection	✓		✓
FaithfulFaces (Wang et al., 6 May 2026)	DiT + pose-shared aligner	Euler angle embed, pose-invariant contrastive	✓		✓
TPIGE (Gao et al., 1 Sep 2025)	DiT + prompt/img GE	Black-box prompt & ID enhancement, plug-in	✓
ID-Composer (Pan et al., 1 Nov 2025)	DiT + hierarchical attn	Multi-subject attn, VLM-guidance, RLVR		✓		✓
I2V-Adapter (Guo et al., 2023)	Cross-frame attention	Training-free, plug-in, frame similarity prior	✓

Encompassing these developments, IPT2V research continues to push advances in high-fidelity, prompt-controllable, robust identity-preserving video synthesis, moving toward practical deployment in creative, entertainment, and telepresence domains.