Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamID-V: Diffusion Transformer Face Swap

Updated 4 July 2026
  • DreamID-V is a diffusion transformer-based video face swapping framework that transfers a source identity onto a target video while preserving key attributes like pose, expression, lighting, and dynamic background.
  • It employs a paired-data construction pipeline (SyncID-Pipe) with explicit supervision and modality-aware conditioning to bridge the gap between image and video face swapping techniques.
  • The framework enhances temporal coherence and identity stability through a two-stage synthetic-to-real training curriculum combined with identity-focused reinforcement learning.

DreamID-V is a video face swapping framework that addresses the problem of transferring a source identity into a target video while preserving the target video’s pose, expression, lighting, background, and dynamics. In its own formulation, Video Face Swapping (VFS) requires generating a video in which the identity matches the source face and the attributes follow the driving video, with temporal consistency across frames. The system is presented as the first Diffusion Transformer-based VFS framework and is built around a paired-data construction pipeline, a video DiT conditioned on identity, pose, and context, a curriculum from synthetic to real supervision, an identity-focused reinforcement-learning procedure, and a dedicated benchmark, IDBench-V (Guo et al., 4 Jan 2026).

1. Problem formulation and design rationale

DreamID-V formalizes VFS as the generation of a video that combines the identity of a source face with the attributes of a target video. The attributes explicitly include background, pose, expression, lighting, and dynamics. Given a source identity image and a target video, the output video is required to preserve source identity, transfer target pose and expression accurately, retain lighting and background, and remain temporally coherent, both in motion smoothness and in identity stability across frames (Guo et al., 4 Jan 2026).

A central motivation of the framework is the gap between Image Face Swapping (IFS) and VFS. The supplied synthesis states that IFS methods such as DreamID and FaceAdapter achieve high identity and attribute fidelity at the image level, but frame-by-frame application to video causes flickering, jitter, and identity drift. Existing VFS methods improve temporal coherence, yet generally underperform strong IFS systems on identity similarity and fine attribute preservation. DreamID-V is therefore designed to “transfer the superiority” of IFS to the video domain through explicit supervision rather than weakly supervised or purely inpainting-based training.

The framework’s goals are correspondingly specific: high-fidelity identity preservation, strong attribute preservation, superior temporal coherence, robustness under small faces, large pose, occlusions, complex expressions, heavy motion, cluttered scenes, and animation-like or stylized inputs, as well as adaptability to other human-centric swap tasks. This suggests that DreamID-V is not only a generation architecture but also a data-construction and supervision strategy for identity-consistent video editing.

2. SyncID-Pipe and the construction of explicit supervision

The data side of DreamID-V is SyncID-Pipe, which constructs paired, identity-aligned video data using an Identity-Anchored Video Synthesizer (IVS) together with a strong IFS model. In the supplied description, the IFS model is DreamID, a diffusion-based face swapping method that uses Triplet ID Group supervision and SD Turbo for explicit, fast image-level swapping (Ye et al., 20 Apr 2025). DreamID-V uses this image-side strength as a teacher signal rather than attempting to discover video supervision implicitly.

SyncID-Pipe begins with a real image–video pair (Ir,Vr)(I_r, V_r) of identity AA and a target identity image IgI_g of identity BB. DreamID is applied to the first and last frames of VrV_r, swapping identity BB onto those keyframes to obtain Iref1I_{\text{ref1}} and Iref2I_{\text{ref2}}. IVS then synthesizes a full video VgV_g whose identity is BB and whose motion follows the pose sequence extracted from AA0. The resulting supervision structure is the bidirectional ID quadruplet

AA1

This quadruplet supports two kinds of training data. The forward-generated paired data AA2 uses source image AA3, synthetic target video AA4, and real ground-truth video AA5. The backward-real paired data AA6 or AA7 reverses the roles and is used for real augmentation training. The explicit alignment between identity and motion is the point: identity quality is inherited from the image swapper, motion is inherited from the real driving video, and supervision is no longer merely heuristic.

Two auxiliary mechanisms refine the quadruplets. Expression adaptation uses a 3D face reconstruction model to extract identity coefficients from AA8 and expression-plus-pose coefficients from each frame of AA9, then projects the recomposed 3D face to retargeted landmarks. Enhanced background recomposition uses SAM2 for foreground masks, MiniMaxRemover for background extraction, feathering for blending, and a conditional-input-only use of the composite video IgI_g0, while the supervision target remains the real video IgI_g1. This design is intended to preserve realism without teaching the model to imitate compositing artifacts.

3. Diffusion Transformer backbone and Modality-Aware Conditioning

On the model side, DreamID-V adopts a video Diffusion Transformer trained with flow matching rather than a DDPM-style objective. In the supplied formulation, the latent interpolation is

IgI_g2

and the model predicts a velocity field IgI_g3 with loss

IgI_g4

The clean video latent IgI_g5 is produced by a video VAE, then patchified into spatio-temporal tokens for Transformer processing (Guo et al., 4 Jan 2026).

The core conditioning mechanism is Modality-Aware Conditioning (MAC), which decouples three signal classes. Spatio-temporal context consists of the reference video and a dilated face mask; it is concatenated with the noisy latent along the channel dimension, providing local spatial-temporal alignment for background, lighting, clothing, and the region to be edited. Structural guidance is the pose sequence, encoded by a Pose Guider and injected through Pose-Attention. Identity information is encoded from a source identity image and appended as tokens along the token dimension, allowing full interaction between identity tokens and video tokens across all frames.

The Pose-Attention branch is inherited from IVS. With latent features IgI_g6 and pose features IgI_g7, DreamID-V uses

IgI_g8

Only the pose-path parameters are trained in IVS adaptation, while the original DiT attention remains frozen. In DreamID-V this becomes a structural control mechanism that guides motion and expression without collapsing the underlying video prior.

The architectural logic is explicit. Context channels impose local alignment with the target scene, pose-guided attention imposes structural motion control, and identity tokens impose a global identity constraint. The paper’s interpretation is that this separation helps disentangle identity from attributes and reduces identity leakage. This suggests that MAC is less a generic conditioning block than a modality-specific factorization of the face-swapping problem.

4. Training curriculum and identity-coherence reinforcement learning

DreamID-V uses a two-stage Synthetic-to-Real Curriculum. The first stage, Synthetic Training (ST), uses forward-generated data IgI_g9. Because BB0 is synthesized by IVS and thus remains closer to the base video model’s distribution, this stage is described as accelerating convergence and producing high identity similarity. The second stage, Real Augmentation Training (RAT), uses backward-real paired data built around composite video BB1 and real target BB2; this stage improves visual realism and background consistency by adapting the model toward real-world distributions (Guo et al., 4 Jan 2026).

The ablation summary in the supplied data identifies complementary behavior. Removing ST yields better FVD but worse ID-Arc similarity; removing RAT yields better identity similarity but worse FVD. Combining both produces strong identity, good FVD, and low variance. The intended effect is therefore sequential: solve identity alignment in a synthetic-friendly regime, then recover realism while preserving identity performance.

Identity-Coherence Reinforcement Learning (IRL) addresses residual frame-wise identity instability. The generative model is treated as a policy, conditions are the state, generated frames are the action, and identity difficulty is quantified by

BB3

where BB4 is a face-recognition embedding and BB5 is the target identity image. Lower cosine similarity implies a higher BB6-value and thus a harder frame.

IRL is implemented as Q-weighted flow matching. After a no-gradient forward sampling pass, DreamID-V computes frame-wise BB7-values, averages them within each VAE chunk, and reweights the flow-matching loss: BB8 The supplied synthesis emphasizes that no separate Q-network or TD learning is required because identity similarity is directly computable. In effect, DreamID-V biases optimization toward profile frames, fast-motion frames, and other cases in which identity otherwise degrades.

5. Benchmarking and reported empirical performance

DreamID-V introduces IDBench-V, a benchmark of 200 video–image pairs. The benchmark is explicitly described as covering real-world, challenging scenes, including small faces, extreme head poses, severe occlusions, complex or dynamic expressions, and cluttered multi-person scenes. Evaluation is organized into three metric groups: identity consistency, attribute preservation, and video quality (Guo et al., 4 Jan 2026).

Identity consistency is measured with ArcFace, InsightFace, and CurricularFace, reported as ID-Arc, ID-Ins, and ID-Cur, together with the variance of frame-wise identity similarities. Attribute preservation uses pose error from HopeNet, expression error from Deep3DFaceRecon, and VBench metrics for background consistency, subject consistency, and motion smoothness. Video quality is measured by Fréchet Video Distance (FVD) with a ResNeXt feature extractor.

On IDBench-V, the reported quantitative results place DreamID-V ahead of the listed baselines on identity consistency: ID-Arc BB9 versus DreamID VrV_r0 and CanonSwap VrV_r1, ID-Ins VrV_r2 versus DreamID VrV_r3, and ID-Cur VrV_r4 versus DreamID VrV_r5. Variance is reported as VrV_r6, slightly lower than CanonSwap’s VrV_r7. On attribute preservation, pose error is VrV_r8, slightly worse than CanonSwap’s VrV_r9, while expression error is BB0, the best among the compared methods. Background and Subject are reported as BB1 and BB2. FVD is BB3, close to CanonSwap’s BB4 and substantially better than the cited IFS baselines such as REFace at BB5.

A user study with 19 evaluators rated identity similarity, attribute preservation, and video quality on a 1–5 scale. The synthesis reports approximately BB6 for identity similarity, BB7 for attribute preservation, and BB8 for video quality, all identified as the top scores in the comparison. Qualitative analysis emphasizes robustness under occlusions, complex expressions, extreme poses, animation-like inputs, and dynamic backgrounds.

The ablations are unusually central to the paper’s claims. Removing quadruplet supervision reduces ID-Arc to BB9 and increases variance. Removing IRL leaves reasonable identity similarity at Iref1I_{\text{ref1}}0 but raises variance to Iref1I_{\text{ref1}}1. These numbers support the paper’s central argument that explicit supervision and chunk-weighted identity optimization, rather than the DiT backbone alone, are what close the IFS–VFS gap.

6. Place within the DreamID family and later extensions

DreamID-V sits between DreamID, an image face swapping model, and DreamID-Omni, an audio-video extension. The image-side DreamID framework is a diffusion-based face swapper that uses Triplet ID Group supervision, SD Turbo, and the modules SwapNet, FaceNet, and ID Adapter to achieve high-fidelity and fast image swapping at Iref1I_{\text{ref1}}2 resolution in Iref1I_{\text{ref1}}3 seconds (Ye et al., 20 Apr 2025). DreamID-V explicitly uses DreamID as the IFS component inside SyncID-Pipe. In that sense, DreamID-V is not merely inspired by image swapping; it operationalizes image-level identity supervision as training data for video.

The later DreamID-Omni paper describes DreamID-V as a video-only face-swapping or character-replacement system and presents DreamID-Omni as its audio-video generalization and successor. DreamID-Omni extends the line to joint audio-video generation, unifies reference-based generation, editing, and audio-driven animation, adds timbre control and multi-person disentanglement, and introduces a Symmetric Conditional Diffusion Transformer, Synchronized RoPE, Structured Captions, and Multi-Task Progressive Training (Guo et al., 12 Feb 2026). In that description, DreamID-V becomes the video-only editing slice of a broader identity-preserving generative family.

DreamID-V also claims broader versatility within video. By replacing the IFS model in SyncID-Pipe with a more general image editing model, the same pipeline can supervise other swap-related tasks, including outfit, accessory, headphone, and hairstyle swapping. The paper states that no architectural changes are required; only the data-construction step changes what attribute is being swapped. This suggests that DreamID-V’s most reusable contribution may be the supervision mechanism rather than the face-swapping endpoint alone.

7. Limitations, ethical considerations, and nomenclatural ambiguity

The supplied description identifies several limitations. DreamID-V depends on large foundation models, including large DiT video models and VAEs, and therefore incurs high compute and memory costs. It also depends on substantial pretraining data: IVS is described as requiring 1000 hours of data, and DreamID-V uses 100+ hours of synthetic and real data. Failure modes are implied for very extreme or rare poses, highly stylized inputs beyond the training distribution, and heavy occlusions that fully hide the face (Guo et al., 4 Jan 2026).

The ethical concerns are explicit. DreamID-V can produce high-fidelity, temporally consistent deepfakes, with stated misuse risks that include non-consensual pornography, impersonation, fraud, and political disinformation. The cited mitigation is a click-through license prohibiting malicious or privacy-violating uses, together with a requirement that users obtain explicit consent from identifiable individuals before publishing outputs. Future directions in the supplied synthesis include watermarking, detection, longer videos, higher resolutions, and more fine-grained control over expression, identity, and partial swaps.

A plausible source of confusion arises from naming. The earlier paper "DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation" consistently uses the name DreamIdentity rather than DreamID-V, and it concerns optimization-free identity-preserving text-to-image generation with a multi-word multi-scale ID encoder and self-augmented editability learning (Chen et al., 2023). The supplied synthesis suggests that “DreamID” or “DreamID-V” may have been used informally as shorthand for that method, but this is an interpretation rather than the explicit nomenclature of the paper itself. In the published arXiv record, DreamID-V refers specifically to the Diffusion Transformer-based video face swapping framework of (Guo et al., 4 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamID-V.