DreamSwapV: End-to-End Video Subject Swapping

Updated 3 July 2026

DreamSwapV is a mask-guided, subject-agnostic framework that replaces a masked source subject with a target using reference-guided diffusion.
It employs a diffusion-transformer backbone with a novel Condition Fusion Module to align multi-modal inputs, ensuring temporal and spatial coherence.
Adaptive mask strategies and a two-phase training regimen enable state-of-the-art performance on benchmarks while preserving video motion and context.

DreamSwapV is a mask-guided, subject-agnostic, end-to-end video editing framework targeting generic subject swapping in unconstrained videos. Formulated as a specialized video-inpainting task, DreamSwapV replaces a masked source subject in each frame with a target subject specified via a user-supplied mask and reference image, while preserving the original video’s motion, background, and contextual interactions. The system builds on a diffusion-transformer backbone with a dedicated condition fusion approach, utilizes adaptive mask strategies to accommodate subjects of varying scales and attributes, and undergoes a two-phase dataset construction and training regimen. DreamSwapV establishes new state-of-the-art performance on automatic quantitative and perceptual benchmarks, notably on the DreamSwapV-Benchmark and VBench suite (Wang et al., 20 Aug 2025).

1. Problem Formulation and Framework Objectives

DreamSwapV formalizes subject swapping within video editing as the inpainting of a masked region, driven by a reference image, while maintaining all remaining spatiotemporal content. Let $V=\{v_t\}_{t=1}^T$ be a sequence of source video frames, with associated binary masks $M^s=\{m_t^s\}_{t=1}^T$ delineating the subject $s$ to swap. The user specifies $m_0^s$ on frame 0; subsequent masks are tracked throughout the sequence. Auxiliary per-frame motion information $P=\{p_t\}_{t=1}^T$ (encompassing 2D/3D human pose and 3D hand pose) provides guidance for accurately modeling dynamic interactions. The target appearance is encoded by a reference image $r^s$ .

The objective is to learn a function

$f_\theta: (V \odot (1-M^s), M^s, P, r^s) \rightarrow V'$

where $V'$ recovers the background outside the mask and inpaints the masked region with the new subject appearance, preserving motion, temporal coherence, and realistic subject–context relationships. Training is self-supervised: reference $r'$ is sampled from a random source frame via $r' = v_i \odot m_i^s$ , and the model is trained to minimize a base diffusion-model loss $M^s=\{m_t^s\}_{t=1}^T$ 0 for reconstructing $M^s=\{m_t^s\}_{t=1}^T$ 1 from its ablated version. At test time, substituting $M^s=\{m_t^s\}_{t=1}^T$ 2 for $M^s=\{m_t^s\}_{t=1}^T$ 3 accomplishes subject swapping (Wang et al., 20 Aug 2025).

2. Model Architecture and Condition Fusion

DreamSwapV leverages the Wan-I2V-14B diffusion-transformer (a DiT variant) architecture with a 3D-VAE encoder/decoder as its foundation. Its principal architectural innovation is the Condition Fusion Module (CFM), which enables strictly spatial-temporally aligned integration of five conditioning signals: mask, agnostic video (background), pose, hand, and reference.

Latent space projection (via the pretrained 3D-VAE) encodes both the agnostic video $M^s=\{m_t^s\}_{t=1}^T$ 4 and motion sequence $M^s=\{m_t^s\}_{t=1}^T$ 5 into tensors of shape $M^s=\{m_t^s\}_{t=1}^T$ 6. The reference image $M^s=\{m_t^s\}_{t=1}^T$ 7 is encoded separately into a latent $M^s=\{m_t^s\}_{t=1}^T$ 8 of $M^s=\{m_t^s\}_{t=1}^T$ 9, which is then temporally duplicated and concatenated with the video latent along the time axis. This temporal concatenation—combined with a custom self-attention mask enforcing that video tokens attend to all tokens while reference tokens attend only to themselves—enables fine-grained reference injection without feature misalignment or parameter inflation.

The binary mask sequence $s$ 0 is reshaped by grouping raw frames and downsampling spatially to match latents. Following zero-padding of the reference in the background and pose streams, final feature fusion is performed via concatenation along the channel dimension:

$s$ 1

The fused representation, augmented by reference tokens in self-attention, is then processed by the diffusion U-Net to predict noise residuals at each denoising step (Wang et al., 20 Aug 2025).

3. Adaptive Mask Strategy

To mitigate “shape leakage” (overfitting to mask contours) and artifacts from overly coarse masks, DreamSwapV incorporates two complementary mask augmentations:

Adaptive Grid Sizing: Each frame is divided into $s$ 2 blocks, where $s$ 3 and $s$ 4 are inversely proportional to subject size. Any grid containing mask pixels is dilated. For training, $s$ 5; for inference, $s$ 6 (similarly for width). Larger subjects thus maintain finer mask fidelity, while smaller subjects generalize more effectively.
Shape Augmentation: For 30% of samples, only a bounding-box dilation is applied to produce the coarsest mask. Otherwise, randomly selected geometric shapes (circles, triangles, rectangles) are overlaid at grid boundaries to decouple shape cues from subject content.

These augmentations enhance the robustness of the model across subject categories and scales (Wang et al., 20 Aug 2025).

4. Training Objectives and Two-Phase Training Regimen

The training objective is built around a diffusion noise-prediction loss $s$ 7, commonly used in pre-trained diffusion backbone models. To further bias inpainting accuracy on masked regions—especially small subjects—a subject-region re-weighting loss $s$ 8 is introduced:

$s$ 9

where $m_0^s$ 0 and $m_0^s$ 1 are pixel counts of the full frame and the subject mask, respectively. The combined objective is

$m_0^s$ 2

with $m_0^s$ 3 set empirically.

Dataset construction and training proceed in two phases:

Phase 1—Pre-training: Starting with 8K HumanVID videos, subject captions are generated with TikTok-VFM-7B, and per-frame masks with TrackingSAM. Filtering ensures a balanced subject-type distribution (humans:garments:small:large ≈ 1:0.2:1:1). Poses are extracted using DWPose and Hamer detectors. Reference augmentation (scaling, rotation, flips, brightness) prevents “reference leakage”. Self-attention layers alone are trained initially (15K iterations, 32 H100 GPUs).
Phase 2—Quality Tuning: High-quality reference–video pairs (~800) from AnyInsertion, Subject200K, and AnchorCrafter-400 are converted to video format by Wan-I2V. All parameters are unfrozen for 10K iteration fine-tuning, enhancing generalization to cross-domain inputs.

During inference, for very small masks (<5% area), a “tunnel” inpainting technique (crop → inpaint → blend) is applied to concentrate modeling capacity on fine details (Wang et al., 20 Aug 2025).

5. Evaluation Protocols and Quantitative Benchmarks

Comprehensive evaluation employs the DreamSwapV-Benchmark—a curated challenge set of 100 videos (Pexels.com) encompassing four aspect ratios and 167 annotated subject instances across diverse categories (humans, animals, apparel, handheld and large objects). The quantitative protocol includes the VBench metrics:

Subject Consistency (SC)
Background Consistency (BC)
Motion Smoothness (MS)
Dynamic Degree (DD)
Aesthetic Quality (AQ)

with per-metric and average scores reported as percentages or normalized values. Two additional automatic metrics are introduced:

Reference Appearance (RA): measures similarity between swapped subject and reference image.
Background Preservation (BP): quantifies background similarity between source and output.

A user study collects crowdsourced ratings (1–5 scale) on Reference Detail, Subject Interaction, and Visual Fidelity. Table 1 from the original study summarizes outcomes across methods:

Method	Avg. VBench	RA (%)	BP (%)	Reference Detail	Subject Interaction	Visual Fidelity
AnyV2V	—	—	—	—	—	—
VACE	—	—	—	—	—	—
HunyuanCustom	—	—	—	—	—	—
Kling 1.6	79.79	—	—	—	—	—
DreamSwapV	80.44	45.22	52.49	3.35	3.39	3.32

DreamSwapV attains the highest VBench average, RA, and BP, and leads human ratings across all perceptual criteria. Notably, AnyV2V suffers from background collapses, while Kling 1.6 sometimes regenerates backgrounds destructively, underscoring DreamSwapV’s robustness (Wang et al., 20 Aug 2025).

6. Implementation, Practical Constraints, and Future Directions

DreamSwapV’s training utilizes 32 NVIDIA H100 GPUs, with approximately 12.5 days for both pre-training (15K iterations) and tuning (10K iterations). Inference for 720p videos executes at ∼2 seconds/frame. Processing of longer videos is managed via overlapping segments, using a “first-frame dummy reference” to ensure temporal coherence.

Limitations include potential minor hallucinations when references are extremely cross-domain (e.g., large differences in scale/lighting), and the need for manual threshold adjustment in tunnel inpainting for tiny objects. The model currently supports up to 720p resolution.

Future research directions identified include extending DreamSwapV to higher resolutions, integrating explicit temporal-consistency or adversarial losses, augmenting with audio editing capabilities, and enabling combined text + reference conditioning for enhanced control (Wang et al., 20 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamSwapV.