FantasyID: Identity-Preserving Synthesis & Forensics

Updated 9 May 2026

FantasyID Framework is a set of methods integrating face-enhanced text-to-video synthesis and digital forgery detection to preserve visual identity and dynamic facial attributes.
It employs a frozen DiT backbone augmented with multi-view 2D/3D feature extraction and layer-wise adaptive injection to maintain high fidelity in generated facial motion.
The framework includes a public dataset for realistic identity document forgery detection, challenging existing forensic detectors with sophisticated manipulation techniques.

FantasyID Framework describes a set of methods, models, and datasets for robust, identity-preserving generation and forensic detection in human-centric visual synthesis. The designation "FantasyID" encompasses two principal lines of research: (1) a face knowledge-enhanced, tuning-free framework for identity-preserving text-to-video generation using diffusion transformers, and (2) a public dataset and detection testbed for analyzing digital manipulations of identity documents, with applications in security and document forensics. Both streams share a focus on leveraging advanced priors and adaptive strategies to either maintain or interrogate the integrity of digital identities in visual media, responding to the challenges posed by large-scale generative models and sophisticated manipulation techniques (Zhang et al., 19 Feb 2025, Korshunov et al., 28 Jul 2025).

1. Identity-Preserving Video Generation with Frozen DiT Backbones

The FantasyID video generation framework employs a large, pre-trained Diffusion Transformer (DiT) model as an immutable backbone for text-to-video synthesis, aiming to preserve the facial identity of reference images with high fidelity across generated video frames (Zhang et al., 19 Feb 2025). Rather than refining the DiT weights, the framework injects elaborate external face signals at each transformer block using learned lightweight modules. The key workflow includes:

Constructing a multi-view pool of six face crops per subject, spanning maximal head-pose diversity.
Extracting 2D appearance features via a convolutional face abstractor.
Computing a 3D vertex cloud from each reference view using the DECA facial shape estimator (FLAME shape, pose and expression disentangled).
Fusing 2D and 3D representations through a fusion transformer to yield a unified face descriptor.
At generation time, adaptively injecting the fused descriptor into every DiT transformer block via learnable, layer-specific convolutional adapters, without modifying the core denoising network.

Training is restricted to the 2D/3D abstractors, fusion layers, and adapters. At inference, a single reference frame suffices and the backbone remains entirely unchanged.

2. Multi-View and 3D Priors for Facial Structure and Motion Dynamics

Preserving identity and dynamic plausibility requires disentangling invariants (identity) from dynamic attributes (expression, pose, motion) during video generation. FantasyID addresses this with:

A 3D facial geometry prior: DECA outputs a vertex set encoding the FLAME shape (identity only, omitting expression/pose), which is further processed by a SpiralNet++ graph network and augmented with depth-based encoding.
Multi-view augmentation: By selecting six distinct face crops with maximal head-pose variation from each training video (using a landmark-based solver such as RetinaFace), the model prevents degenerate "copy-paste" tendencies and encourages learning of dynamic facial movements.
At each training step, a randomly drawn face view ensures exposure to substantial pose and expression diversity, compelling the diffusion model to synthesize realistic and expressive facial dynamics beyond static replication.

This strategy guarantees robust identity retention alongside rich facial motion, with geometrically plausible trajectories throughout the video sequence.

3. Feature Fusion and Layer-Aware Adaptive Injection

Integrating multimodal face priors robustly into the video diffusion process demands precise, context-aware fusion. FantasyID pursues this by:

Channel-aligning 2D (local appearance) and 3D (geometry) features via a learned MLP, then concatenating and refining them through a transformer and a stack of 1D residual convolutions.
Yielding a fused embedding that carries both spatial detail and global geometric context.
For each DiT block $l$ , computing adaptive injection signals using attention-based fusion between the DiT latent features and the unified face descriptor. The injection is parameterized by a unique, learned adapter $F_l$ , permitting sophisticated, depth-sensitive modulation (coarse-to-fine control) of the identity cues at each layer.

This layer-aware adaptive strategy circumvents the pitfalls of naive cross-attention or uniform injection, delivering a balanced tradeoff between detailed identity retention and expressive facial motion.

4. Evaluation Criteria and Empirical Findings

FantasyID is benchmarked against state-of-the-art tuning-free identity-preserving video generation baselines (ID-Animator, ConsisID), using metrics that capture both fidelity and dynamics:

Reference Similarity (RS), assessed via ArcFace embeddings: FantasyID achieves 0.57, surpassing ConsisID (0.47) and ID-Animator (0.35).
Face Motion (FM), measured through dense optical flow: 0.61 for FantasyID vs. 0.54 (ConsisID) and 0.18 (ID-Animator).
Face-region Fréchet Inception Distance (FID): 142.5 (FantasyID), competitive with ID-Animator (138.3) and outperforming ConsisID (149.7).

In qualitative user studies (32 participants), FantasyID is rated highest for overall quality, face similarity, structural stability, and facial dynamics. Component ablations establish that omitting any single innovation (multi-view pools, 3D prior, fusion transformer, layer-aware injection) degrades identity preservation and motion consistency significantly (Zhang et al., 19 Feb 2025).

5. Data-Driven Document Generation and Forgery Detection

A parallel contribution under the FantasyID designation is a synthetic dataset for digital manipulation detection in identity documents. Notable features include:

13 ID-card templates in ten languages, populated with realistic, non-synthetic faces sampled from diverse public datasets.
All 362 unique cards were physically printed and recaptured with three device types (two mobile cameras and one flatbed scanner) to simulate "bonafide" images.
Three attack types for forgeries: text inpainting/injection, face swapping, and "unseen" test manipulations using advanced generative editing tools (e.g., InSwapper, FaceDancer, TextDiffuser2, LaMA).

The resulting train/validation and test splits provide both in-domain (familiar algorithm) and out-of-domain (held-out method or template) conditions (Korshunov et al., 28 Jul 2025).

6. Baseline Detectors and Benchmarking Protocols

Detection experiments employ four leading open-source forensic detectors:

TruFor: Multi-branch transformer aggregating RGB plus noiseprint features.
MMFusion: Adds SRM and Bayar filter branches to TruFor for early fusion.
UniFD: A CLIP-ViT-L-based linear classifier.
FatFormer: CLIP-ViT-L with forgery-adapter and language-guided components.

The evaluation protocol fixes the false positive rate (FPR) to 10% on validation, reporting the resulting false negative rate (FNR), half-total-error-rate (HTER), AUC, and F1 on the test set. Results confirm a challenging regime: FNRs approach or exceed 50% in realistic conditions, particularly for face-only manipulations, showing that current detectors struggle with highly localized, blended, or photorealistic tampering.

Method	ACC (%)	AUC (%)	F1 (%)	FPR	FNR	HTER (%)
TruFor	65.9	93.5	80.7	4.9%	62.0%	33.4
MMFusion	55.1	94.4	73.7	4.0%	47.7%	25.8
UniFD	50.0	52.0	7.7	8.3%	92.7%	50.5
FatFormer	48.8	53.5	15.6	6.5%	92.3%	49.4

The hardest detection targets are face blends, which lack sharp copy-paste boundaries. Text forgery is somewhat easier for noiseprint and SRM-based approaches. Detection robustness is further degraded by spatial downsampling (to 224×224 or 256×256), which erases subtle artifacts relevant to document forensics.

7. Implications, Limitations, and Future Directions

The FantasyID framework demonstrates that tuning-free, large-scale generative models can synthesize identity- and motion-faithful face videos by leveraging 3D geometric priors, multi-view augmentation, multimodal feature fusion, and depth-adaptive signal injection—without fine-tuning the core diffusion block (Zhang et al., 19 Feb 2025). Quantitative and qualitative benchmarks favor this modular approach over prior IPT2V methods, suggesting broad applicability to non-static identity conditioning tasks.

In the domain of document forensics, the FantasyID dataset sets a new standard for evaluating forgery detection under realistic acquisition and attack pipelines, revealing critical failure modes (e.g., photorealistic face swaps, localized text changes) for current detectors (Korshunov et al., 28 Jul 2025). This suggests urgent need for detectors with high-resolution, region-localized analysis and integration of both facial and document-specific features.

A plausible implication is that integrated frameworks, such as those outlined for controllable multi-modal avatar generation in DreamID-Omni (Guo et al., 12 Feb 2026), could be extended with the FantasyID strategies to enable both robust identity-preserving synthesis and adversarially aware forgery detection, even in fantasy or multi-character settings.

The public availability of the dataset and code under permissive licenses supports reproducibility and extension. Recommended avenues include explicit pixel-level tamper localization, multi-scale forensic feature fusion, and adaptation to newer generative editing pipelines. As generative models proliferate, FantasyID provides both a rigorous synthesis standard and a critical detection benchmark.