DreamActor-M2: Universal Image Animation
- DreamActor-M2 is a universal character image animation framework that transfers motion from driving videos to static reference images without using explicit pose priors.
- It employs a two-stage design combining spatiotemporal in-context learning with self-bootstrapped data synthesis to balance identity preservation and motion fidelity.
- Benchmark results on AWBench demonstrate superior imaging, motion smoothness, and temporal consistency compared to prior state-of-the-art methods.
DreamActor-M2 is a universal character image animation framework that synthesizes high-fidelity video sequences by transferring motion from a driving sequence to a static reference image. Distinct from previous approaches, DreamActor-M2 eliminates the trade-off between identity preservation and motion fidelity and forgoes explicit pose priors, achieving robust cross-domain generalization for arbitrary characters—including non-humanoid types—via spatiotemporal in-context learning and self-bootstrapped data synthesis (Luo et al., 29 Jan 2026).
1. Problem Formulation and Motivation
Character image animation requires generating an output sequence where the subject from a static reference image %%%%1%%%% performs the motions observed in a driving video . Previous approaches struggled with two central issues:
- See-saw Trade-off: Methods reliant on channel-wise pose injection (e.g., skeleton concatenation) enforce strong spatial alignment but leak reference structure ("shape leakage"), deforming identity. Cross-attention injection, while decoupling identity, compresses motion signals, reducing temporal detail fidelity.
- Pose Prior Dependence: Most frameworks rely on explicit pose priors (e.g., 2D skeletons, SMPL), leading to poor generalization to non-humanoid or occluded scenarios. Approaches avoiding direct poses necessitate per-video adaptation or pose-based supervision at training, constraining scalability.
DreamActor-M2 addresses both, introducing a unified latent space for identity and motion without explicit pose signals through its two-stage design (Luo et al., 29 Jan 2026).
2. Model Design: Two-Stage Paradigm
2.1 Stage 1: Spatiotemporal In-Context Learning
Instead of separate modules for motion injection, DreamActor-M2 utilizes a pre-trained video diffusion backbone (Seedance 1.0, MMDiT-transformer). It constructs a spatiotemporal context tensor , which spatially concatenates the reference frame with each driving frame:
- At : (spatial concat, )
- At : , where is a blank mask
Corresponding binary masks are generated: (reference mask) and (motion mask), concatenated into . Both and are encoded by a 3D VAE into latent . The diffusion transformer, , predicts the original latent from noisy conditioned on . The diffusion loss:
enables joint reasoning about identity and motion.
2.2 Stage 2: Self-Bootstrapped Data Synthesis for End-to-End Training
To enable direct RGB-guided animation without skeletons, Stage 2 synthesizes pseudo cross-identity training triplets using the pose-based model from Stage 1. Pseudocode is:
1 2 3 4 5 6 7 8 |
Dataset D = {}
for each source video V_src in web crawl:
P_src = PoseExtractor(V_src)
choose random reference image I_o
V_o = M_pose(P_src, I_o)
if QualityFilter(V_o, V_src):
I_ref = V_src[0]
D.add( (V_o, I_ref, V_src) ) |
QualityFilter requires Video-Bench score plus manual validation. The resulting 60K triplets enable supervision for the end-to-end RGB-driven model, which reconstructs from the (generated , ) context. The model is warm-started from Stage 1 weights to inherit motion priors and ensure sample efficiency.
3. Network Components and Optimization Objectives
- Encoder/Decoder: 3D VAE () encodes/decodes from RGB () to latent ()
- Diffusion Transformer (): Multimodal MMDiT backbone, stacked along space and time, receives context and mask via channel concatenation at each layer
- LoRA Adaptation: Low-rank adapters () in each transformer FFN, allowing backbone parameter freezing
Main losses:
- Diffusion Reconstruction (primary):
- Optional Identity Consistency (via ArcFace):
- Optional Motion Consistency:
- Adversarial Loss:
Overall training objective:
with (grid-searched).
4. Benchmarking: AWBench and Evaluation Metrics
AWBench ("Animate in the Wild") is introduced to facilitate comprehensive, universal character animation evaluation:
| Component | Description |
|---|---|
| Driving corpus | 100 videos: humans (various part/body types, activities), animals, and cartoons |
| Reference set | 200 static images, matching categories including multi-subject scenes |
| Scenarios | One-to-one, one-to-many, many-to-many cross-identity transfers |
Evaluation Metrics:
- Video-Bench human-aligned automatic scores Han et al. 2025: Imaging Quality, Motion Smoothness, Temporal Consistency, Appearance Consistency
- Human User Study: 12 participants, 5-point scale ratings for imaging, motion, and appearance
- Common generative metrics (not directly applicable in cross-identity settings): FID, LPIPS, ID-Score
5. Experimental Analysis
DreamActor-M2 demonstrates state-of-the-art performance on AWBench:
- Pose-based DreamActor-M2: 4.68/4.53/4.61/4.28 (Imaging/Motion/Temporal/Appearance) on Video-Bench; previous DreamActor-M1 and other SOTA and respectively
- End-to-end DreamActor-M2: 4.72/4.56/4.69/4.35, further surpassing Stage 1
- Human studies: 4.27 ± 0.18 (Imaging), 4.24 ± 0.23 (Motion), 4.20 ± 0.29 (Appearance), exceeding all baselines by 0.3 points
- Platform-level (GSB): +9.66% over commercial Kling 2.6, +51% over DreamActor-M1
Noteworthy findings:
- Fine-grained hand/body preservation across domains
- Robustness to incomplete driving signals (e.g., hallucinated lower body from half-body input)
- Effective multi-subject and non-human (animalanimal, cartooncartoon) animation
- Ablations: Removing spatiotemporal ICL, pose augmentation, or text guidance each degrade specific metrics; end-to-end outperforms pose-based in ambiguous cases
6. Limitations and Future Directions
DreamActor-M2 exhibits failure cases in complex multi-person interactions (e.g., interlocking/orbiting trajectories) due to insufficient training data covering such scenarios. The architecture is computationally intensive (3D VAE + transformer, 24 GB GPU memory, 4 s per 16-frame clip). Future research priorities:
- Curation of multi-person interaction datasets
- Exploration of architectures with sparser attention or dynamic tokenization for efficiency
- Integration of 3D scene priors or trajectory-aware modules to better model character interactions and crossing paths
DreamActor-M2 establishes a unified, plug-and-play framework for universal character image animation, balancing identity fidelity and motion realism via spatiotemporal in-context learning while eliminating reliance on explicit pose priors (Luo et al., 29 Jan 2026).