Papers
Topics
Authors
Recent
2000 character limit reached

DreamActor-M2: Universal Image Animation

Updated 5 February 2026
  • DreamActor-M2 is a universal character image animation framework that transfers motion from driving videos to static reference images without using explicit pose priors.
  • It employs a two-stage design combining spatiotemporal in-context learning with self-bootstrapped data synthesis to balance identity preservation and motion fidelity.
  • Benchmark results on AWBench demonstrate superior imaging, motion smoothness, and temporal consistency compared to prior state-of-the-art methods.

DreamActor-M2 is a universal character image animation framework that synthesizes high-fidelity video sequences by transferring motion from a driving sequence to a static reference image. Distinct from previous approaches, DreamActor-M2 eliminates the trade-off between identity preservation and motion fidelity and forgoes explicit pose priors, achieving robust cross-domain generalization for arbitrary characters—including non-humanoid types—via spatiotemporal in-context learning and self-bootstrapped data synthesis (Luo et al., 29 Jan 2026).

1. Problem Formulation and Motivation

Character image animation requires generating an output sequence Y^RT×H×W×3\hat{Y} \in \mathbb{R}^{T \times H \times W \times 3} where the subject from a static reference image %%%%1%%%% performs the motions observed in a driving video DRT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}. Previous approaches struggled with two central issues:

  • See-saw Trade-off: Methods reliant on channel-wise pose injection (e.g., skeleton concatenation) enforce strong spatial alignment but leak reference structure ("shape leakage"), deforming identity. Cross-attention injection, while decoupling identity, compresses motion signals, reducing temporal detail fidelity.
  • Pose Prior Dependence: Most frameworks rely on explicit pose priors (e.g., 2D skeletons, SMPL), leading to poor generalization to non-humanoid or occluded scenarios. Approaches avoiding direct poses necessitate per-video adaptation or pose-based supervision at training, constraining scalability.

DreamActor-M2 addresses both, introducing a unified latent space for identity and motion without explicit pose signals through its two-stage design (Luo et al., 29 Jan 2026).

2. Model Design: Two-Stage Paradigm

2.1 Stage 1: Spatiotemporal In-Context Learning

Instead of separate modules for motion injection, DreamActor-M2 utilizes a pre-trained video diffusion backbone (Seedance 1.0, MMDiT-transformer). It constructs a spatiotemporal context tensor CRT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}, which spatially concatenates the reference frame with each driving frame:

  • At t=0t = 0: C[0]=IrefD[0]C[0] = I_\text{ref} \oplus D[0] (spatial concat, \oplus)
  • At t>0t > 0: C[t]=0D[t]C[t] = 0 \oplus D[t], where 0RH×W×30 \in \mathbb{R}^{H \times W \times 3} is a blank mask

Corresponding binary masks are generated: MrM_r (reference mask) and MmM_m (motion mask), concatenated into MM. Both CC and MM are encoded by a 3D VAE into latent ZRT×h×2w×cZ \in \mathbb{R}^{T \times h \times 2w \times c}. The diffusion transformer, ϵθ\epsilon_\theta, predicts the original latent from noisy ZtZ_t conditioned on (Z,Znoise,M)(Z, Z_\text{noise}, M). The diffusion loss:

Ldiff=Et,ϵϵϵθ(Zt,{Z,M},t)22L_\text{diff} = \mathbb{E}_{t,\epsilon} \| \epsilon - \epsilon_\theta(Z_t, \{Z, M\}, t) \|_2^2

enables joint reasoning about identity and motion.

2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training

To enable direct RGB-guided animation without skeletons, Stage 2 synthesizes pseudo cross-identity training triplets using the pose-based model from Stage 1. Pseudocode is:

1
2
3
4
5
6
7
8
Dataset D = {}
for each source video V_src in web crawl:
    P_src = PoseExtractor(V_src)
    choose random reference image I_o
    V_o = M_pose(P_src, I_o)
    if QualityFilter(V_o, V_src):
        I_ref = V_src[0]
        D.add( (V_o, I_ref, V_src) )

QualityFilter requires Video-Bench score >4.5>4.5 plus manual validation. The resulting \sim60K triplets enable supervision for the end-to-end RGB-driven model, which reconstructs VsrcV_\text{src} from the (generated VoV_o, IrefI_\text{ref}) context. The model is warm-started from Stage 1 weights to inherit motion priors and ensure sample efficiency.

3. Network Components and Optimization Objectives

  • Encoder/Decoder: 3D VAE (ξ,ξ1\xi, \xi^{-1}) encodes/decodes from RGB (H×2WH \times 2W) to latent (h×2w×ch \times 2w \times c)
  • Diffusion Transformer (ϵθ\epsilon_\theta): Multimodal MMDiT backbone, stacked along space and time, receives context and mask via channel concatenation at each layer
  • LoRA Adaptation: Low-rank adapters (r=256r=256) in each transformer FFN, allowing backbone parameter freezing

Main losses:

  1. Diffusion Reconstruction (primary): Ldiff=Ex,t,ϵϵϵθ(zt,c,t)22L_\text{diff} = \mathbb{E}_{x, t, \epsilon} \|\epsilon - \epsilon_\theta(z_t, c, t)\|_2^2
  2. Optional Identity Consistency (via ArcFace): Lid=EIref,Y^[1cos(ϕid(Iref),ϕid(Y^))]L_\text{id} = \mathbb{E}_{I_\text{ref}, \hat{Y}}[1 - \cos(\phi_\text{id}(I_\text{ref}), \phi_\text{id}(\hat{Y}))]
  3. Optional Motion Consistency: Lmotion=Eψ(Y^)Ptar22L_\text{motion} = \mathbb{E}\| \psi(\hat{Y}) - P_\text{tar} \|_2^2
  4. Adversarial Loss: Ladv=Ereal[logD(Y)]+Efake[log(1D(Y^))]L_\text{adv} = \mathbb{E}_\text{real}[-\log D(Y)] + \mathbb{E}_\text{fake}[-\log (1 - D(\hat{Y}))]

Overall training objective:

Ltotal=λdiffLdiff+λidLid+λmotionLmotion+λadvLadvL_\text{total} = \lambda_\text{diff} L_\text{diff} + \lambda_\text{id} L_\text{id} + \lambda_\text{motion} L_\text{motion} + \lambda_\text{adv} L_\text{adv}

with λdiff=1.0,λid=0.1,λmotion=0.1,λadv=0.01\lambda_\text{diff}=1.0, \lambda_\text{id}=0.1, \lambda_\text{motion}=0.1, \lambda_\text{adv}=0.01 (grid-searched).

4. Benchmarking: AWBench and Evaluation Metrics

AWBench ("Animate in the Wild") is introduced to facilitate comprehensive, universal character animation evaluation:

Component Description
Driving corpus 100 videos: humans (various part/body types, activities), animals, and cartoons
Reference set 200 static images, matching categories including multi-subject scenes
Scenarios One-to-one, one-to-many, many-to-many cross-identity transfers

Evaluation Metrics:

  • Video-Bench human-aligned automatic scores Han et al. 2025: Imaging Quality, Motion Smoothness, Temporal Consistency, Appearance Consistency
  • Human User Study: 12 participants, 5-point scale ratings for imaging, motion, and appearance
  • Common generative metrics (not directly applicable in cross-identity settings): FID, LPIPS, ID-Score

5. Experimental Analysis

DreamActor-M2 demonstrates state-of-the-art performance on AWBench:

  • Pose-based DreamActor-M2: 4.68/4.53/4.61/4.28 (Imaging/Motion/Temporal/Appearance) on Video-Bench; previous DreamActor-M1 and other SOTA 4.21\leq 4.21 and 4.06\leq 4.06 respectively
  • End-to-end DreamActor-M2: 4.72/4.56/4.69/4.35, further surpassing Stage 1
  • Human studies: 4.27 ± 0.18 (Imaging), 4.24 ± 0.23 (Motion), 4.20 ± 0.29 (Appearance), exceeding all baselines by \geq0.3 points
  • Platform-level (GSB): +9.66% over commercial Kling 2.6, +51% over DreamActor-M1

Noteworthy findings:

  • Fine-grained hand/body preservation across domains
  • Robustness to incomplete driving signals (e.g., hallucinated lower body from half-body input)
  • Effective multi-subject and non-human (animal\rightarrowanimal, cartoon\rightarrowcartoon) animation
  • Ablations: Removing spatiotemporal ICL, pose augmentation, or text guidance each degrade specific metrics; end-to-end outperforms pose-based in ambiguous cases

6. Limitations and Future Directions

DreamActor-M2 exhibits failure cases in complex multi-person interactions (e.g., interlocking/orbiting trajectories) due to insufficient training data covering such scenarios. The architecture is computationally intensive (3D VAE + transformer, \geq24 GB GPU memory, \sim4 s per 16-frame clip). Future research priorities:

  • Curation of multi-person interaction datasets
  • Exploration of architectures with sparser attention or dynamic tokenization for efficiency
  • Integration of 3D scene priors or trajectory-aware modules to better model character interactions and crossing paths

DreamActor-M2 establishes a unified, plug-and-play framework for universal character image animation, balancing identity fidelity and motion realism via spatiotemporal in-context learning while eliminating reliance on explicit pose priors (Luo et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamActor-M2.