DreamActor-M2: Universal Image Animation
- DreamActor-M2 is a universal character image animation framework that transfers motion from driving videos to static reference images without using explicit pose priors.
- It employs a two-stage design combining spatiotemporal in-context learning with self-bootstrapped data synthesis to balance identity preservation and motion fidelity.
- Benchmark results on AWBench demonstrate superior imaging, motion smoothness, and temporal consistency compared to prior state-of-the-art methods.
DreamActor-M2 is a universal character image animation framework that synthesizes high-fidelity video sequences by transferring motion from a driving sequence to a static reference image. Distinct from previous approaches, DreamActor-M2 eliminates the trade-off between identity preservation and motion fidelity and forgoes explicit pose priors, achieving robust cross-domain generalization for arbitrary characters—including non-humanoid types—via spatiotemporal in-context learning and self-bootstrapped data synthesis (Luo et al., 29 Jan 2026).
1. Problem Formulation and Motivation
Character image animation requires generating an output sequence where the subject from a static reference image performs the motions observed in a driving video . Previous approaches struggled with two central issues:
- See-saw Trade-off: Methods reliant on channel-wise pose injection (e.g., skeleton concatenation) enforce strong spatial alignment but leak reference structure ("shape leakage"), deforming identity. Cross-attention injection, while decoupling identity, compresses motion signals, reducing temporal detail fidelity.
- Pose Prior Dependence: Most frameworks rely on explicit pose priors (e.g., 2D skeletons, SMPL), leading to poor generalization to non-humanoid or occluded scenarios. Approaches avoiding direct poses necessitate per-video adaptation or pose-based supervision at training, constraining scalability.
DreamActor-M2 addresses both, introducing a unified latent space for identity and motion without explicit pose signals through its two-stage design (Luo et al., 29 Jan 2026).
2. Model Design: Two-Stage Paradigm
2.1 Stage 1: Spatiotemporal In-Context Learning
Instead of separate modules for motion injection, DreamActor-M2 utilizes a pre-trained video diffusion backbone (Seedance 1.0, MMDiT-transformer). It constructs a spatiotemporal context tensor , which spatially concatenates the reference frame with each driving frame:
- At : (spatial concat, )
- At : , where is a blank mask
Corresponding binary masks are generated: 0 (reference mask) and 1 (motion mask), concatenated into 2. Both 3 and 4 are encoded by a 3D VAE into latent 5. The diffusion transformer, 6, predicts the original latent from noisy 7 conditioned on 8. The diffusion loss:
9
enables joint reasoning about identity and motion.
2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training
To enable direct RGB-guided animation without skeletons, Stage 2 synthesizes pseudo cross-identity training triplets using the pose-based model from Stage 1. Pseudocode is:
3
QualityFilter requires Video-Bench score 0 plus manual validation. The resulting 160K triplets enable supervision for the end-to-end RGB-driven model, which reconstructs 2 from the (generated 3, 4) context. The model is warm-started from Stage 1 weights to inherit motion priors and ensure sample efficiency.
3. Network Components and Optimization Objectives
- Encoder/Decoder: 3D VAE (5) encodes/decodes from RGB (6) to latent (7)
- Diffusion Transformer (8): Multimodal MMDiT backbone, stacked along space and time, receives context and mask via channel concatenation at each layer
- LoRA Adaptation: Low-rank adapters (9) in each transformer FFN, allowing backbone parameter freezing
Main losses:
- Diffusion Reconstruction (primary): 0
- Optional Identity Consistency (via ArcFace): 1
- Optional Motion Consistency: 2
- Adversarial Loss: 3
Overall training objective:
4
with 5 (grid-searched).
4. Benchmarking: AWBench and Evaluation Metrics
AWBench ("Animate in the Wild") is introduced to facilitate comprehensive, universal character animation evaluation:
| Component | Description |
|---|---|
| Driving corpus | 100 videos: humans (various part/body types, activities), animals, and cartoons |
| Reference set | 200 static images, matching categories including multi-subject scenes |
| Scenarios | One-to-one, one-to-many, many-to-many cross-identity transfers |
Evaluation Metrics:
- Video-Bench human-aligned automatic scores Han et al. 2025: Imaging Quality, Motion Smoothness, Temporal Consistency, Appearance Consistency
- Human User Study: 12 participants, 5-point scale ratings for imaging, motion, and appearance
- Common generative metrics (not directly applicable in cross-identity settings): FID, LPIPS, ID-Score
5. Experimental Analysis
DreamActor-M2 demonstrates state-of-the-art performance on AWBench:
- Pose-based DreamActor-M2: 4.68/4.53/4.61/4.28 (Imaging/Motion/Temporal/Appearance) on Video-Bench; previous DreamActor-M1 and other SOTA 6 and 7 respectively
- End-to-end DreamActor-M2: 4.72/4.56/4.69/4.35, further surpassing Stage 1
- Human studies: 4.27 ± 0.18 (Imaging), 4.24 ± 0.23 (Motion), 4.20 ± 0.29 (Appearance), exceeding all baselines by 80.3 points
- Platform-level (GSB): +9.66% over commercial Kling 2.6, +51% over DreamActor-M1
Noteworthy findings:
- Fine-grained hand/body preservation across domains
- Robustness to incomplete driving signals (e.g., hallucinated lower body from half-body input)
- Effective multi-subject and non-human (animal9animal, cartoon0cartoon) animation
- Ablations: Removing spatiotemporal ICL, pose augmentation, or text guidance each degrade specific metrics; end-to-end outperforms pose-based in ambiguous cases
6. Limitations and Future Directions
DreamActor-M2 exhibits failure cases in complex multi-person interactions (e.g., interlocking/orbiting trajectories) due to insufficient training data covering such scenarios. The architecture is computationally intensive (3D VAE + transformer, 124 GB GPU memory, 24 s per 16-frame clip). Future research priorities:
- Curation of multi-person interaction datasets
- Exploration of architectures with sparser attention or dynamic tokenization for efficiency
- Integration of 3D scene priors or trajectory-aware modules to better model character interactions and crossing paths
DreamActor-M2 establishes a unified, plug-and-play framework for universal character image animation, balancing identity fidelity and motion realism via spatiotemporal in-context learning while eliminating reliance on explicit pose priors (Luo et al., 29 Jan 2026).