DreamID-V: Diffusion Transformer Face Swap

Updated 4 July 2026

DreamID-V is a diffusion transformer-based video face swapping framework that transfers a source identity onto a target video while preserving key attributes like pose, expression, lighting, and dynamic background.
It employs a paired-data construction pipeline (SyncID-Pipe) with explicit supervision and modality-aware conditioning to bridge the gap between image and video face swapping techniques.
The framework enhances temporal coherence and identity stability through a two-stage synthetic-to-real training curriculum combined with identity-focused reinforcement learning.

DreamID-V is a video face swapping framework that addresses the problem of transferring a source identity into a target video while preserving the target video’s pose, expression, lighting, background, and dynamics. In its own formulation, Video Face Swapping (VFS) requires generating a video in which the identity matches the source face and the attributes follow the driving video, with temporal consistency across frames. The system is presented as the first Diffusion Transformer-based VFS framework and is built around a paired-data construction pipeline, a video DiT conditioned on identity, pose, and context, a curriculum from synthetic to real supervision, an identity-focused reinforcement-learning procedure, and a dedicated benchmark, IDBench-V (Guo et al., 4 Jan 2026).

1. Problem formulation and design rationale

DreamID-V formalizes VFS as the generation of a video that combines the identity of a source face with the attributes of a target video. The attributes explicitly include background, pose, expression, lighting, and dynamics. Given a source identity image and a target video, the output video is required to preserve source identity, transfer target pose and expression accurately, retain lighting and background, and remain temporally coherent, both in motion smoothness and in identity stability across frames (Guo et al., 4 Jan 2026).

A central motivation of the framework is the gap between Image Face Swapping (IFS) and VFS. The supplied synthesis states that IFS methods such as DreamID and FaceAdapter achieve high identity and attribute fidelity at the image level, but frame-by-frame application to video causes flickering, jitter, and identity drift. Existing VFS methods improve temporal coherence, yet generally underperform strong IFS systems on identity similarity and fine attribute preservation. DreamID-V is therefore designed to “transfer the superiority” of IFS to the video domain through explicit supervision rather than weakly supervised or purely inpainting-based training.

The framework’s goals are correspondingly specific: high-fidelity identity preservation, strong attribute preservation, superior temporal coherence, robustness under small faces, large pose, occlusions, complex expressions, heavy motion, cluttered scenes, and animation-like or stylized inputs, as well as adaptability to other human-centric swap tasks. This suggests that DreamID-V is not only a generation architecture but also a data-construction and supervision strategy for identity-consistent video editing.

2. SyncID-Pipe and the construction of explicit supervision

The data side of DreamID-V is SyncID-Pipe, which constructs paired, identity-aligned video data using an Identity-Anchored Video Synthesizer (IVS) together with a strong IFS model. In the supplied description, the IFS model is DreamID, a diffusion-based face swapping method that uses Triplet ID Group supervision and SD Turbo for explicit, fast image-level swapping (Ye et al., 20 Apr 2025). DreamID-V uses this image-side strength as a teacher signal rather than attempting to discover video supervision implicitly.

SyncID-Pipe begins with a real image–video pair $(I_r, V_r)$ of identity $A$ and a target identity image $I_g$ of identity $B$ . DreamID is applied to the first and last frames of $V_r$ , swapping identity $B$ onto those keyframes to obtain $I_{\text{ref1}}$ and $I_{\text{ref2}}$ . IVS then synthesizes a full video $V_g$ whose identity is $B$ and whose motion follows the pose sequence extracted from $A$ 0. The resulting supervision structure is the bidirectional ID quadruplet

$A$ 1

This quadruplet supports two kinds of training data. The forward-generated paired data $A$ 2 uses source image $A$ 3, synthetic target video $A$ 4, and real ground-truth video $A$ 5. The backward-real paired data $A$ 6 or $A$ 7 reverses the roles and is used for real augmentation training. The explicit alignment between identity and motion is the point: identity quality is inherited from the image swapper, motion is inherited from the real driving video, and supervision is no longer merely heuristic.

Two auxiliary mechanisms refine the quadruplets. Expression adaptation uses a 3D face reconstruction model to extract identity coefficients from $A$ 8 and expression-plus-pose coefficients from each frame of $A$ 9, then projects the recomposed 3D face to retargeted landmarks. Enhanced background recomposition uses SAM2 for foreground masks, MiniMaxRemover for background extraction, feathering for blending, and a conditional-input-only use of the composite video $I_g$ 0, while the supervision target remains the real video $I_g$ 1. This design is intended to preserve realism without teaching the model to imitate compositing artifacts.

3. Diffusion Transformer backbone and Modality-Aware Conditioning

On the model side, DreamID-V adopts a video Diffusion Transformer trained with flow matching rather than a DDPM-style objective. In the supplied formulation, the latent interpolation is

$I_g$ 2

and the model predicts a velocity field $I_g$ 3 with loss

$I_g$ 4

The clean video latent $I_g$ 5 is produced by a video VAE, then patchified into spatio-temporal tokens for Transformer processing (Guo et al., 4 Jan 2026).

The core conditioning mechanism is Modality-Aware Conditioning (MAC), which decouples three signal classes. Spatio-temporal context consists of the reference video and a dilated face mask; it is concatenated with the noisy latent along the channel dimension, providing local spatial-temporal alignment for background, lighting, clothing, and the region to be edited. Structural guidance is the pose sequence, encoded by a Pose Guider and injected through Pose-Attention. Identity information is encoded from a source identity image and appended as tokens along the token dimension, allowing full interaction between identity tokens and video tokens across all frames.

The Pose-Attention branch is inherited from IVS. With latent features $I_g$ 6 and pose features $I_g$ 7, DreamID-V uses

$I_g$ 8

Only the pose-path parameters are trained in IVS adaptation, while the original DiT attention remains frozen. In DreamID-V this becomes a structural control mechanism that guides motion and expression without collapsing the underlying video prior.

The architectural logic is explicit. Context channels impose local alignment with the target scene, pose-guided attention imposes structural motion control, and identity tokens impose a global identity constraint. The paper’s interpretation is that this separation helps disentangle identity from attributes and reduces identity leakage. This suggests that MAC is less a generic conditioning block than a modality-specific factorization of the face-swapping problem.

4. Training curriculum and identity-coherence reinforcement learning

DreamID-V uses a two-stage Synthetic-to-Real Curriculum. The first stage, Synthetic Training (ST), uses forward-generated data $I_g$ 9. Because $B$ 0 is synthesized by IVS and thus remains closer to the base video model’s distribution, this stage is described as accelerating convergence and producing high identity similarity. The second stage, Real Augmentation Training (RAT), uses backward-real paired data built around composite video $B$ 1 and real target $B$ 2; this stage improves visual realism and background consistency by adapting the model toward real-world distributions (Guo et al., 4 Jan 2026).

The ablation summary in the supplied data identifies complementary behavior. Removing ST yields better FVD but worse ID-Arc similarity; removing RAT yields better identity similarity but worse FVD. Combining both produces strong identity, good FVD, and low variance. The intended effect is therefore sequential: solve identity alignment in a synthetic-friendly regime, then recover realism while preserving identity performance.

Identity-Coherence Reinforcement Learning (IRL) addresses residual frame-wise identity instability. The generative model is treated as a policy, conditions are the state, generated frames are the action, and identity difficulty is quantified by

$B$ 3

where $B$ 4 is a face-recognition embedding and $B$ 5 is the target identity image. Lower cosine similarity implies a higher $B$ 6-value and thus a harder frame.

IRL is implemented as Q-weighted flow matching. After a no-gradient forward sampling pass, DreamID-V computes frame-wise $B$ 7-values, averages them within each VAE chunk, and reweights the flow-matching loss: $B$ 8 The supplied synthesis emphasizes that no separate Q-network or TD learning is required because identity similarity is directly computable. In effect, DreamID-V biases optimization toward profile frames, fast-motion frames, and other cases in which identity otherwise degrades.

5. Benchmarking and reported empirical performance

DreamID-V introduces IDBench-V, a benchmark of 200 video–image pairs. The benchmark is explicitly described as covering real-world, challenging scenes, including small faces, extreme head poses, severe occlusions, complex or dynamic expressions, and cluttered multi-person scenes. Evaluation is organized into three metric groups: identity consistency, attribute preservation, and video quality (Guo et al., 4 Jan 2026).

Identity consistency is measured with ArcFace, InsightFace, and CurricularFace, reported as ID-Arc, ID-Ins, and ID-Cur, together with the variance of frame-wise identity similarities. Attribute preservation uses pose error from HopeNet, expression error from Deep3DFaceRecon, and VBench metrics for background consistency, subject consistency, and motion smoothness. Video quality is measured by Fréchet Video Distance (FVD) with a ResNeXt feature extractor.

On IDBench-V, the reported quantitative results place DreamID-V ahead of the listed baselines on identity consistency: ID-Arc $B$ 9 versus DreamID $V_r$ 0 and CanonSwap $V_r$ 1, ID-Ins $V_r$ 2 versus DreamID $V_r$ 3, and ID-Cur $V_r$ 4 versus DreamID $V_r$ 5. Variance is reported as $V_r$ 6, slightly lower than CanonSwap’s $V_r$ 7. On attribute preservation, pose error is $V_r$ 8, slightly worse than CanonSwap’s $V_r$ 9, while expression error is $B$ 0, the best among the compared methods. Background and Subject are reported as $B$ 1 and $B$ 2. FVD is $B$ 3, close to CanonSwap’s $B$ 4 and substantially better than the cited IFS baselines such as REFace at $B$ 5.

A user study with 19 evaluators rated identity similarity, attribute preservation, and video quality on a 1–5 scale. The synthesis reports approximately $B$ 6 for identity similarity, $B$ 7 for attribute preservation, and $B$ 8 for video quality, all identified as the top scores in the comparison. Qualitative analysis emphasizes robustness under occlusions, complex expressions, extreme poses, animation-like inputs, and dynamic backgrounds.

The ablations are unusually central to the paper’s claims. Removing quadruplet supervision reduces ID-Arc to $B$ 9 and increases variance. Removing IRL leaves reasonable identity similarity at $I_{\text{ref1}}$ 0 but raises variance to $I_{\text{ref1}}$ 1. These numbers support the paper’s central argument that explicit supervision and chunk-weighted identity optimization, rather than the DiT backbone alone, are what close the IFS–VFS gap.

6. Place within the DreamID family and later extensions

DreamID-V sits between DreamID, an image face swapping model, and DreamID-Omni, an audio-video extension. The image-side DreamID framework is a diffusion-based face swapper that uses Triplet ID Group supervision, SD Turbo, and the modules SwapNet, FaceNet, and ID Adapter to achieve high-fidelity and fast image swapping at $I_{\text{ref1}}$ 2 resolution in $I_{\text{ref1}}$ 3 seconds (Ye et al., 20 Apr 2025). DreamID-V explicitly uses DreamID as the IFS component inside SyncID-Pipe. In that sense, DreamID-V is not merely inspired by image swapping; it operationalizes image-level identity supervision as training data for video.

The later DreamID-Omni paper describes DreamID-V as a video-only face-swapping or character-replacement system and presents DreamID-Omni as its audio-video generalization and successor. DreamID-Omni extends the line to joint audio-video generation, unifies reference-based generation, editing, and audio-driven animation, adds timbre control and multi-person disentanglement, and introduces a Symmetric Conditional Diffusion Transformer, Synchronized RoPE, Structured Captions, and Multi-Task Progressive Training (Guo et al., 12 Feb 2026). In that description, DreamID-V becomes the video-only editing slice of a broader identity-preserving generative family.

DreamID-V also claims broader versatility within video. By replacing the IFS model in SyncID-Pipe with a more general image editing model, the same pipeline can supervise other swap-related tasks, including outfit, accessory, headphone, and hairstyle swapping. The paper states that no architectural changes are required; only the data-construction step changes what attribute is being swapped. This suggests that DreamID-V’s most reusable contribution may be the supervision mechanism rather than the face-swapping endpoint alone.

7. Limitations, ethical considerations, and nomenclatural ambiguity

The supplied description identifies several limitations. DreamID-V depends on large foundation models, including large DiT video models and VAEs, and therefore incurs high compute and memory costs. It also depends on substantial pretraining data: IVS is described as requiring 1000 hours of data, and DreamID-V uses 100+ hours of synthetic and real data. Failure modes are implied for very extreme or rare poses, highly stylized inputs beyond the training distribution, and heavy occlusions that fully hide the face (Guo et al., 4 Jan 2026).

The ethical concerns are explicit. DreamID-V can produce high-fidelity, temporally consistent deepfakes, with stated misuse risks that include non-consensual pornography, impersonation, fraud, and political disinformation. The cited mitigation is a click-through license prohibiting malicious or privacy-violating uses, together with a requirement that users obtain explicit consent from identifiable individuals before publishing outputs. Future directions in the supplied synthesis include watermarking, detection, longer videos, higher resolutions, and more fine-grained control over expression, identity, and partial swaps.

A plausible source of confusion arises from naming. The earlier paper "DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation" consistently uses the name DreamIdentity rather than DreamID-V, and it concerns optimization-free identity-preserving text-to-image generation with a multi-word multi-scale ID encoder and self-augmented editability learning (Chen et al., 2023). The supplied synthesis suggests that “DreamID” or “DreamID-V” may have been used informally as shorthand for that method, but this is an interpretation rather than the explicit nomenclature of the paper itself. In the published arXiv record, DreamID-V refers specifically to the Diffusion Transformer-based video face swapping framework of (Guo et al., 4 Jan 2026).