Papers
Topics
Authors
Recent
Search
2000 character limit reached

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Published 29 Jan 2026 in cs.CV and cs.AI | (2601.21716v1)

Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a "see-saw", and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/

Summary

  • The paper introduces a novel spatiotemporal in-context learning method to seamlessly blend appearance and motion for enhanced character animation.
  • The paper employs a two-stage training approach with pose-based augmentation and end-to-end pseudo-pair synthesis, ensuring robust identity preservation and motion consistency.
  • The paper validates its approach using AWBench, demonstrating superior imaging quality and temporal consistency over state-of-the-art methods.

DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Overview and Motivation

DreamActor-M2 (2601.21716) addresses the core challenges in universal character image animation, particularly the simultaneous achievement of strong identity preservation and motion consistency across diverse domains. Traditional motion injection techniques, including pose-based and cross-attention methods, present critical trade-offs: pose-based approaches can degrade identity preservation due to structural leakage, while cross-attention compresses temporal motion details, limiting fidelity. Furthermore, reliance on explicit pose priors constrains generalization, failing in scenarios with non-humanoid subjects and complex multi-agent dynamics. DreamActor-M2 reframes motion conditioning as a spatiotemporal in-context learning problem, inspired by ICL advances in LLMs and VLMs. Figure 1

Figure 1: DreamActor-M2 demonstrates generalization by animating heterogeneous characters while maintaining consistent appearance.

Methodology

Spatiotemporal In-Context Learning

The architecture employs Seedance 1.0 as the video generation backbone (MMDiT), leveraging its multimodal generative priors. Motion control signals are concatenated spatiotemporally with the reference image to form a unified input sequence. The model receives composite input frames: (1) The first frame concatenates reference appearance and motion anchor, (2) Subsequent frames align motion frames with blank reference masks. Masking strategies distinguish identity and motion regions, facilitating robust context integration. Figure 2

Figure 2: Schematic overview of the DreamActor-M2 pipeline illustrating unified spatial-temporal concatenation for motion transfer.

Pose-Based and End-to-End Training Paradigms

The two-stage training paradigm begins with a pose-based variant using augmented 2D skeletons, reinforcing motion context while damping semantic identity leakage via bone length scaling and box normalization. Semantic expressiveness is amplified by multimodal LLMs, parsing motion and appearance semantics and fusing them into target-oriented prompts for enhanced controllability.

Transitioning to the end-to-end variant, DreamActor-M2 leverages self-bootstrapped pseudo-pair synthesis. The pose-based model generates new high-quality videos which become supervision targets paired with original motion sources. Rigorous dual-stage quality filtering ensures only motion-consistent and identity-faithful samples are retained. The end-to-end model subsequently learns direct motion condition inference from RGB inputs, eliminating explicit pose requirements and expanding applicability to arbitrary subjects.

Benchmark: AWBench

DreamActor-M2 introduces AWBench, a comprehensive evaluation benchmark with 100 driving videos and 200 reference images, spanning humans, animals, animated/cartoon subjects, and multi-agent scenarios. It accommodates one-to-one, one-to-many, and many-to-many mappings—unaddressed in previous works—with rich intra- and cross-domain diversity. Examples illustrate diversity of reference and driving pairs. Figure 3

Figure 3: Representative driving video and reference image examples from AWBench.

Experimental Results

Quantitative Evaluation

DreamActor-M2 surpasses state-of-the-art baselines (Animate-X++, MTVCrafter, DreamActor-M1, Wan2.2-Animate) in both automatic and human-aligned metrics as assessed by Video-Bench, with notable margins in imaging quality, motion smoothness, temporal consistency, and appearance consistency. For example, End-to-End DreamActor-M2 achieves 4.72 (quality), 4.56 (smoothness), 4.69 (consistency), and 4.35 (appearance) on Video-Bench automatic metrics.

Qualitative Comparison

Visual inspection on AWBench demonstrates DreamActor-M2's superior fidelity and robustness, even in challenging cross-domain and multi-person setups, where baselines suffer from blurring, artifacts, and structure collapse. The model accurately reproduces fine motion semantics, preserves body shape across identity transfers, and reliably synthesizes complex gestures. One-to-many and multi-to-multi mappings further validate scalability. Figure 4

Figure 4: Qualitative comparison between DreamActor-M2 and SOTA methods on AWBench, emphasizing fidelity and identity preservation.

Generalization and Versatility

DreamActor-M2's architecture generalizes across various shot types, reference identities (humans, animals, cartoons, objects), driving domains, and multi-agent scenarios. The model extrapolates plausible motion for incomplete driving signals, adapts to non-human inputs, and consistently manages complex group dynamics. Figure 5

Figure 5: Visualization of diverse shot types, demonstrating extrapolation when lower-body driving signals are absent.

Figure 6

Figure 6: Animation results for varied reference character types (rabbit, juice bottle, anime characters).

Figure 7

Figure 7: Animation results for diverse driving character types (Animal2Animal, Cartoon2Cartoon).

Figure 8

Figure 8: Robust multi-person character animation under one-to-many and many-to-many settings.

Ablation Analysis

Component-wise ablation studies confirm key architectural advantages:

  • Spatial-Temporal In-Context Injection: Outperforms temporal concatenation by preserving intricate motion semantics and structural details.
  • Pose Augmentation: Critical for counteracting shape leakage and maintaining identity consistency.
  • LLM-Guided Semantic Fusion: Elevates fine-grained motion control, leading to marked improvements in semantic fidelity for complex actions.
  • End-to-End Advantage: Demonstrates superior performance in ambiguous or occluded keypoint scenarios compared to pose-centric methods. Figure 9

    Figure 9: Ablation study visualizations highlight benefits of spatiotemporal injection, augmentation, and targeted text guidance.

Implications and Future Directions

Practically, DreamActor-M2 democratizes high-fidelity character animation across arbitrary domains, circumventing pose estimator bottlenecks and expansive manual curation. The end-to-end adaptation mechanism paves the way for cost-efficient, generalizable, and scalable animation pipelines for gaming, entertainment, AR/VR, and creative content synthesis. Theoretically, this work substantiates in-context learning as a robust mechanism for spatiotemporal reasoning in generative video models—a direction ripe for further exploration in vision foundation models and causal/invariant motion transfer.

Future research should address limitations in handling intricate subject interactions, such as trajectory crossing in multi-agent scenes, and further expand training diversity to fully exploit model extrapolation capabilities. Security and ethical risk mitigation remain active concerns given the framework's proficiency in realistic human and character video synthesis.

Conclusion

DreamActor-M2 sets a new standard for universal character image animation by integrating motion and identity conditions via spatiotemporal in-context learning, achieving high-fidelity, robust generalization across unprecedented subject diversity, and pioneering a practical end-to-end animation paradigm. The AWBench benchmark and ablation studies comprehensively validate its architecture, and its implications signal new possibilities for scalable, domain-agnostic animated content generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.