Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamActor-M2: Universal Image Animation

Updated 5 February 2026
  • DreamActor-M2 is a universal character image animation framework that transfers motion from driving videos to static reference images without using explicit pose priors.
  • It employs a two-stage design combining spatiotemporal in-context learning with self-bootstrapped data synthesis to balance identity preservation and motion fidelity.
  • Benchmark results on AWBench demonstrate superior imaging, motion smoothness, and temporal consistency compared to prior state-of-the-art methods.

DreamActor-M2 is a universal character image animation framework that synthesizes high-fidelity video sequences by transferring motion from a driving sequence to a static reference image. Distinct from previous approaches, DreamActor-M2 eliminates the trade-off between identity preservation and motion fidelity and forgoes explicit pose priors, achieving robust cross-domain generalization for arbitrary characters—including non-humanoid types—via spatiotemporal in-context learning and self-bootstrapped data synthesis (Luo et al., 29 Jan 2026).

1. Problem Formulation and Motivation

Character image animation requires generating an output sequence Y^∈RT×H×W×3\hat{Y} \in \mathbb{R}^{T \times H \times W \times 3} where the subject from a static reference image Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3} performs the motions observed in a driving video D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}. Previous approaches struggled with two central issues:

  • See-saw Trade-off: Methods reliant on channel-wise pose injection (e.g., skeleton concatenation) enforce strong spatial alignment but leak reference structure ("shape leakage"), deforming identity. Cross-attention injection, while decoupling identity, compresses motion signals, reducing temporal detail fidelity.
  • Pose Prior Dependence: Most frameworks rely on explicit pose priors (e.g., 2D skeletons, SMPL), leading to poor generalization to non-humanoid or occluded scenarios. Approaches avoiding direct poses necessitate per-video adaptation or pose-based supervision at training, constraining scalability.

DreamActor-M2 addresses both, introducing a unified latent space for identity and motion without explicit pose signals through its two-stage design (Luo et al., 29 Jan 2026).

2. Model Design: Two-Stage Paradigm

2.1 Stage 1: Spatiotemporal In-Context Learning

Instead of separate modules for motion injection, DreamActor-M2 utilizes a pre-trained video diffusion backbone (Seedance 1.0, MMDiT-transformer). It constructs a spatiotemporal context tensor C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}, which spatially concatenates the reference frame with each driving frame:

  • At t=0t = 0: C[0]=Iref⊕D[0]C[0] = I_\text{ref} \oplus D[0] (spatial concat, ⊕\oplus)
  • At t>0t > 0: C[t]=0⊕D[t]C[t] = 0 \oplus D[t], where 0∈RH×W×30 \in \mathbb{R}^{H \times W \times 3} is a blank mask

Corresponding binary masks are generated: Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}0 (reference mask) and Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}1 (motion mask), concatenated into Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}2. Both Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}3 and Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}4 are encoded by a 3D VAE into latent Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}5. The diffusion transformer, Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}6, predicts the original latent from noisy Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}7 conditioned on Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}8. The diffusion loss:

Iref∈RH×W×3I_\text{ref} \in \mathbb{R}^{H \times W \times 3}9

enables joint reasoning about identity and motion.

2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training

To enable direct RGB-guided animation without skeletons, Stage 2 synthesizes pseudo cross-identity training triplets using the pose-based model from Stage 1. Pseudocode is:

t=0t = 03

QualityFilter requires Video-Bench score D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}0 plus manual validation. The resulting D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}160K triplets enable supervision for the end-to-end RGB-driven model, which reconstructs D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}2 from the (generated D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}3, D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}4) context. The model is warm-started from Stage 1 weights to inherit motion priors and ensure sample efficiency.

3. Network Components and Optimization Objectives

  • Encoder/Decoder: 3D VAE (D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}5) encodes/decodes from RGB (D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}6) to latent (D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}7)
  • Diffusion Transformer (D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}8): Multimodal MMDiT backbone, stacked along space and time, receives context and mask via channel concatenation at each layer
  • LoRA Adaptation: Low-rank adapters (D∈RT×H×W×3D \in \mathbb{R}^{T \times H \times W \times 3}9) in each transformer FFN, allowing backbone parameter freezing

Main losses:

  1. Diffusion Reconstruction (primary): C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}0
  2. Optional Identity Consistency (via ArcFace): C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}1
  3. Optional Motion Consistency: C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}2
  4. Adversarial Loss: C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}3

Overall training objective:

C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}4

with C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}5 (grid-searched).

4. Benchmarking: AWBench and Evaluation Metrics

AWBench ("Animate in the Wild") is introduced to facilitate comprehensive, universal character animation evaluation:

Component Description
Driving corpus 100 videos: humans (various part/body types, activities), animals, and cartoons
Reference set 200 static images, matching categories including multi-subject scenes
Scenarios One-to-one, one-to-many, many-to-many cross-identity transfers

Evaluation Metrics:

  • Video-Bench human-aligned automatic scores Han et al. 2025: Imaging Quality, Motion Smoothness, Temporal Consistency, Appearance Consistency
  • Human User Study: 12 participants, 5-point scale ratings for imaging, motion, and appearance
  • Common generative metrics (not directly applicable in cross-identity settings): FID, LPIPS, ID-Score

5. Experimental Analysis

DreamActor-M2 demonstrates state-of-the-art performance on AWBench:

  • Pose-based DreamActor-M2: 4.68/4.53/4.61/4.28 (Imaging/Motion/Temporal/Appearance) on Video-Bench; previous DreamActor-M1 and other SOTA C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}6 and C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}7 respectively
  • End-to-end DreamActor-M2: 4.72/4.56/4.69/4.35, further surpassing Stage 1
  • Human studies: 4.27 ± 0.18 (Imaging), 4.24 ± 0.23 (Motion), 4.20 ± 0.29 (Appearance), exceeding all baselines by C∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}80.3 points
  • Platform-level (GSB): +9.66% over commercial Kling 2.6, +51% over DreamActor-M1

Noteworthy findings:

  • Fine-grained hand/body preservation across domains
  • Robustness to incomplete driving signals (e.g., hallucinated lower body from half-body input)
  • Effective multi-subject and non-human (animalC∈RT×H×2W×3C \in \mathbb{R}^{T \times H \times 2W \times 3}9animal, cartoont=0t = 00cartoon) animation
  • Ablations: Removing spatiotemporal ICL, pose augmentation, or text guidance each degrade specific metrics; end-to-end outperforms pose-based in ambiguous cases

6. Limitations and Future Directions

DreamActor-M2 exhibits failure cases in complex multi-person interactions (e.g., interlocking/orbiting trajectories) due to insufficient training data covering such scenarios. The architecture is computationally intensive (3D VAE + transformer, t=0t = 0124 GB GPU memory, t=0t = 024 s per 16-frame clip). Future research priorities:

  • Curation of multi-person interaction datasets
  • Exploration of architectures with sparser attention or dynamic tokenization for efficiency
  • Integration of 3D scene priors or trajectory-aware modules to better model character interactions and crossing paths

DreamActor-M2 establishes a unified, plug-and-play framework for universal character image animation, balancing identity fidelity and motion realism via spatiotemporal in-context learning while eliminating reliance on explicit pose priors (Luo et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DreamActor-M2.