DreamActor-M2: Universal Image Animation

Updated 5 February 2026

DreamActor-M2 is a universal character image animation framework that transfers motion from driving videos to static reference images without using explicit pose priors.
It employs a two-stage design combining spatiotemporal in-context learning with self-bootstrapped data synthesis to balance identity preservation and motion fidelity.
Benchmark results on AWBench demonstrate superior imaging, motion smoothness, and temporal consistency compared to prior state-of-the-art methods.

DreamActor-M2 is a universal character image animation framework that synthesizes high-fidelity video sequences by transferring motion from a driving sequence to a static reference image. Distinct from previous approaches, DreamActor-M2 eliminates the trade-off between identity preservation and motion fidelity and forgoes explicit pose priors, achieving robust cross-domain generalization for arbitrary characters—including non-humanoid types—via spatiotemporal in-context learning and self-bootstrapped data synthesis (Luo et al., 29 Jan 2026).

1. Problem Formulation and Motivation

Character image animation requires generating an output sequence $\hat{Y} \in \mathbb{R}^{T \times H \times W \times 3}$ where the subject from a static reference image %%%%1%%%% performs the motions observed in a driving video $D \in \mathbb{R}^{T \times H \times W \times 3}$ . Previous approaches struggled with two central issues:

See-saw Trade-off: Methods reliant on channel-wise pose injection (e.g., skeleton concatenation) enforce strong spatial alignment but leak reference structure ("shape leakage"), deforming identity. Cross-attention injection, while decoupling identity, compresses motion signals, reducing temporal detail fidelity.
Pose Prior Dependence: Most frameworks rely on explicit pose priors (e.g., 2D skeletons, SMPL), leading to poor generalization to non-humanoid or occluded scenarios. Approaches avoiding direct poses necessitate per-video adaptation or pose-based supervision at training, constraining scalability.

DreamActor-M2 addresses both, introducing a unified latent space for identity and motion without explicit pose signals through its two-stage design (Luo et al., 29 Jan 2026).

2. Model Design: Two-Stage Paradigm

2.1 Stage 1: Spatiotemporal In-Context Learning

Instead of separate modules for motion injection, DreamActor-M2 utilizes a pre-trained video diffusion backbone (Seedance 1.0, MMDiT-transformer). It constructs a spatiotemporal context tensor $C \in \mathbb{R}^{T \times H \times 2W \times 3}$ , which spatially concatenates the reference frame with each driving frame:

At $t = 0$ : $C[0] = I_\text{ref} \oplus D[0]$ (spatial concat, $\oplus$ )
At $t > 0$ : $C[t] = 0 \oplus D[t]$ , where $0 \in \mathbb{R}^{H \times W \times 3}$ is a blank mask

Corresponding binary masks are generated: $M_r$ (reference mask) and $M_m$ (motion mask), concatenated into $M$ . Both $C$ and $M$ are encoded by a 3D VAE into latent $Z \in \mathbb{R}^{T \times h \times 2w \times c}$ . The diffusion transformer, $\epsilon_\theta$ , predicts the original latent from noisy $Z_t$ conditioned on $(Z, Z_\text{noise}, M)$ . The diffusion loss:

$L_\text{diff} = \mathbb{E}_{t,\epsilon} \| \epsilon - \epsilon_\theta(Z_t, \{Z, M\}, t) \|_2^2$

enables joint reasoning about identity and motion.

2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training

To enable direct RGB-guided animation without skeletons, Stage 2 synthesizes pseudo cross-identity training triplets using the pose-based model from Stage 1. Pseudocode is:

Dataset D = {}
for each source video V_src in web crawl:
    P_src = PoseExtractor(V_src)
    choose random reference image I_o
    V_o = M_pose(P_src, I_o)
    if QualityFilter(V_o, V_src):
        I_ref = V_src[0]
        D.add( (V_o, I_ref, V_src) )

QualityFilter requires Video-Bench score $>4.5$ plus manual validation. The resulting $\sim$ 60K triplets enable supervision for the end-to-end RGB-driven model, which reconstructs $V_\text{src}$ from the (generated $V_o$ , $I_\text{ref}$ ) context. The model is warm-started from Stage 1 weights to inherit motion priors and ensure sample efficiency.

3. Network Components and Optimization Objectives

Encoder/Decoder: 3D VAE ( $\xi, \xi^{-1}$ ) encodes/decodes from RGB ( $H \times 2W$ ) to latent ( $h \times 2w \times c$ )
Diffusion Transformer ( $\epsilon_\theta$ ): Multimodal MMDiT backbone, stacked along space and time, receives context and mask via channel concatenation at each layer
LoRA Adaptation: Low-rank adapters ( $r=256$ ) in each transformer FFN, allowing backbone parameter freezing

Main losses:

Diffusion Reconstruction (primary): $L_\text{diff} = \mathbb{E}_{x, t, \epsilon} \|\epsilon - \epsilon_\theta(z_t, c, t)\|_2^2$
Optional Identity Consistency (via ArcFace): $L_\text{id} = \mathbb{E}_{I_\text{ref}, \hat{Y}}[1 - \cos(\phi_\text{id}(I_\text{ref}), \phi_\text{id}(\hat{Y}))]$
Optional Motion Consistency: $L_\text{motion} = \mathbb{E}\| \psi(\hat{Y}) - P_\text{tar} \|_2^2$
Adversarial Loss: $L_\text{adv} = \mathbb{E}_\text{real}[-\log D(Y)] + \mathbb{E}_\text{fake}[-\log (1 - D(\hat{Y}))]$

Overall training objective:

$L_\text{total} = \lambda_\text{diff} L_\text{diff} + \lambda_\text{id} L_\text{id} + \lambda_\text{motion} L_\text{motion} + \lambda_\text{adv} L_\text{adv}$

with $\lambda_\text{diff}=1.0, \lambda_\text{id}=0.1, \lambda_\text{motion}=0.1, \lambda_\text{adv}=0.01$ (grid-searched).

4. Benchmarking: AWBench and Evaluation Metrics

AWBench ("Animate in the Wild") is introduced to facilitate comprehensive, universal character animation evaluation:

Component	Description
Driving corpus	100 videos: humans (various part/body types, activities), animals, and cartoons
Reference set	200 static images, matching categories including multi-subject scenes
Scenarios	One-to-one, one-to-many, many-to-many cross-identity transfers

Evaluation Metrics:

Video-Bench human-aligned automatic scores Han et al. 2025: Imaging Quality, Motion Smoothness, Temporal Consistency, Appearance Consistency
Human User Study: 12 participants, 5-point scale ratings for imaging, motion, and appearance
Common generative metrics (not directly applicable in cross-identity settings): FID, LPIPS, ID-Score

5. Experimental Analysis

DreamActor-M2 demonstrates state-of-the-art performance on AWBench:

Pose-based DreamActor-M2: 4.68/4.53/4.61/4.28 (Imaging/Motion/Temporal/Appearance) on Video-Bench; previous DreamActor-M1 and other SOTA $\leq 4.21$ and $\leq 4.06$ respectively
End-to-end DreamActor-M2: 4.72/4.56/4.69/4.35, further surpassing Stage 1
Human studies: 4.27 ± 0.18 (Imaging), 4.24 ± 0.23 (Motion), 4.20 ± 0.29 (Appearance), exceeding all baselines by $\geq$ 0.3 points
Platform-level (GSB): +9.66% over commercial Kling 2.6, +51% over DreamActor-M1

Noteworthy findings:

Fine-grained hand/body preservation across domains
Robustness to incomplete driving signals (e.g., hallucinated lower body from half-body input)
Effective multi-subject and non-human (animal $\rightarrow$ animal, cartoon $\rightarrow$ cartoon) animation
Ablations: Removing spatiotemporal ICL, pose augmentation, or text guidance each degrade specific metrics; end-to-end outperforms pose-based in ambiguous cases

6. Limitations and Future Directions

DreamActor-M2 exhibits failure cases in complex multi-person interactions (e.g., interlocking/orbiting trajectories) due to insufficient training data covering such scenarios. The architecture is computationally intensive (3D VAE + transformer, $\geq$ 24 GB GPU memory, $\sim$ 4 s per 16-frame clip). Future research priorities:

Curation of multi-person interaction datasets
Exploration of architectures with sparser attention or dynamic tokenization for efficiency
Integration of 3D scene priors or trajectory-aware modules to better model character interactions and crossing paths

DreamActor-M2 establishes a unified, plug-and-play framework for universal character image animation, balancing identity fidelity and motion realism via spatiotemporal in-context learning while eliminating reliance on explicit pose priors (Luo et al., 29 Jan 2026).

Markdown Upgrade to Chat

References (1)

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamActor-M2.

DreamActor-M2: Universal Image Animation

1. Problem Formulation and Motivation

2. Model Design: Two-Stage Paradigm

2.1 Stage 1: Spatiotemporal In-Context Learning

2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training

3. Network Components and Optimization Objectives

4. Benchmarking: AWBench and Evaluation Metrics

5. Experimental Analysis

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DreamActor-M2: Universal Image Animation

1. Problem Formulation and Motivation

2. Model Design: Two-Stage Paradigm

2.1 Stage 1: Spatiotemporal In-Context Learning

2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training

3. Network Components and Optimization Objectives

4. Benchmarking: AWBench and Evaluation Metrics

5. Experimental Analysis

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research