- The paper introduces DreamActor-M1, a framework that overcomes fine-grained control challenges in human image animation via hybrid motion and appearance guidance.
- It employs a Diffusion Transformer with flow matching in a 3D VAE latent space and a progressive training strategy to ensure multi-scale adaptability and temporal coherence.
- Evaluation results show DreamActor-M1 outperforms state-of-the-art methods in metrics like FID, SSIM, and PSNR, achieving superior identity preservation and visual fidelity.
This paper introduces DreamActor-M1, a framework for human image animation designed to address limitations in existing methods regarding fine-grained holistic control, multi-scale adaptability, and long-term temporal coherence. The goal is to generate expressive and robust animations from a single image or multiple reference images, driven by motion sequences.
Core Problem:
Existing human animation techniques struggle with:
- Fine-grained Control: Difficulty in accurately controlling subtle facial expressions (like eye blinks, lip tremors) simultaneously with body movements.
- Multi-Scale Adaptability: Lack of robustness when handling inputs of varying scales (portrait, upper-body, full-body) within a single model.
- Temporal Coherence: Maintaining consistency, especially for unseen parts (like the back of clothing), during long video generation or complex movements like rotations.
Proposed Solution: DreamActor-M1
DreamActor-M1 is based on a Diffusion Transformer (DiT) architecture, specifically leveraging a pre-trained image-to-video DiT model (Seaweed (2501.08316)) and utilizing Flow Matching for training within the latent space of a 3D VAE. It introduces three key innovations:
- Hybrid Motion Guidance: To achieve fine-grained control, the system uses a combination of control signals:
- Implicit Facial Representations: A pre-trained face motion encoder (based on PDFGC (2306.05196)) extracts identity-independent expression features from cropped faces in the driving video. These implicit tokens control facial expressions via cross-attention within the DiT blocks, decoupling expression from identity and head pose. An optional audio-driven encoder can map speech to these tokens for lip-syncing without a driving video.
- 3D Head Spheres: Explicitly control head pose (rotation and scale). 3D facial parameters are extracted (using FaceVerse (2205.15175)), and a colored 2D sphere projection representing the head's position, scale (matched to reference), and orientation (encoded by color) is generated. This decouples head pose from facial expression control.
- 3D Body Skeletons: Control body movements. SMPL-X parameters are estimated (using 4DHumans (2310.17080) and HaMeR (2405.17410)), projected to 2D skeletons, and fed into a pose encoder. Using skeletons instead of full meshes allows the model to learn character shape from references. Bone length adjustment is performed during inference by comparing reference and driving subject proportions in an A-pose (using RTMPose (2303.07399) and SeedEdit (2411.06686)) to improve anatomical alignment.
- Complementary Appearance Guidance: To ensure long-term consistency and handle unseen areas (e.g., during turns), a multi-reference injection protocol is proposed:
- Training: Three frames representing diverse viewpoints (max, min, median yaw rotation) plus an optional cropped half-body frame (for full-body inputs) are selected as references.
- Inference (Optional Two-Stage): For challenging cases (e.g., animating a full-body turn from a frontal half-body reference), the model first generates a multi-view sequence ("pseudo-references"). Key frames are selected from this sequence and used as multiple references in a second generation pass, providing richer appearance information for unseen regions. Reference features are injected into the DiT not via a separate ReferenceNet, but by concatenating reference and video latents and using self-attention followed by cross-attention between reference and denoising tokens.
- Progressive Training Strategy: To handle multi-scale data and ensure stable learning, training occurs in three stages:
- Stage 1: Train with only 3D body skeletons and head spheres to adapt the base model to human animation.
- Stage 2: Freeze most parameters and train only the face motion encoder and face attention layers, introducing implicit facial control.
- Stage 3: Unfreeze all parameters and fine-tune the entire model jointly.
Training uses a diverse 500-hour dataset (dancing, sports, films, speeches) with varying resolutions and clip lengths (25-121 frames) resized to a 960×640 area.
Implementation & Architecture:
- Backbone: MMDiT (2406.00455) pre-trained on image-to-video.
- Training Objective: Flow Matching (2210.02747).
- Latent Space: Pre-trained 3D VAE (2310.05737).
- Reference Injection: Concatenated latent patches (reference + video) processed via self-attention and cross-attention within DiT blocks.
- Motion Control Injection: Implicit face tokens via cross-attention; Pose features (encoded head sphere + skeleton) concatenated with noise tokens.
- Inference: Generates 73-frame segments, using the last latent of a segment to initialize the next for consistency. Classifier-free guidance (CFG) is used for reference and motion signals.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
if use_two_stage_inference:
pseudo_references = generate_pseudo_references(I_R, V_drive)
I_R = select_frames(pseudo_references) # Use generated frames as multi-ref
face_crops = crop_faces(V_drive)
implicit_face_tokens = face_motion_encoder(face_crops)
head_params = extract_3d_head_params(V_drive) # e.g., using FaceVerse
head_spheres = render_head_spheres(head_params, I_R) # Match scale to reference
body_params = estimate_smplx(V_drive) # e.g., using 4DHumans, HaMeR
raw_skeletons = project_to_2d_skeletons(body_params)
adjusted_skeletons = adjust_bone_length(raw_skeletons, I_R) # Compare drive/ref in A-pose
pose_maps = concatenate(adjusted_skeletons, head_spheres, dim=channel)
pose_features = pose_encoder(pose_maps)
ref_latent = vae_encoder(I_R)
noise_latent = sample_gaussian_noise(shape=video_shape)
output_video_latents = []
current_latent = noise_latent
for segment_idx in range(num_segments):
segment_pose_features = pose_features[segment_idx]
segment_face_tokens = implicit_face_tokens[segment_idx]
# Flow Matching / Diffusion Denoising Steps
for t in timesteps:
# Prepare input tokens
input_tokens = prepare_dit_input(ref_latent, current_latent, segment_pose_features, t)
# Inject face tokens via cross-attention
output_latent = dit_model(input_tokens, face_tokens=segment_face_tokens, timestep=t)
current_latent = output_latent # Update latent for next step/segment
output_video_latents.append(current_latent)
# Use last frame latent to initialize next segment (if not last segment)
if segment_idx < num_segments - 1:
current_latent = current_latent[-1:, ...] # Get latent of last frame
final_video_latents = concatenate(output_video_latents)
output_video = vae_decoder(final_video_latents)
return output_video |
Evaluation:
DreamActor-M1 was compared against state-of-the-art body animation (Animate Anyone (2311.10324), Champ (2405.11368), MimicMotion (2406.19680), DisPose (2412.09349)) and portrait animation methods (LivePortrait (2407.03168), X-Portrait (2405.03179), SkyReels-A1 (2502.10841), Runway Act-One). Quantitative results (FID, SSIM, PSNR, LPIPS, FVD) on a collected dataset showed DreamActor-M1 outperformed competitors in both categories. Qualitative results demonstrated better fine-grained motion, identity preservation, temporal consistency, and fidelity. Ablation studies confirmed the benefits of the multi-reference protocol (especially pseudo-references for long videos) and the hybrid control signals (implicit face features superior to landmarks, 3D skeletons/spheres superior to 3D mesh).
Limitations:
- Difficulty controlling dynamic camera movements.
- Inability to generate physical interactions with objects.
- Bone length adjustment can be unstable in edge cases, sometimes requiring manual intervention.
Ethics:
The paper acknowledges the potential misuse for creating fake videos and emphasizes the need for ethical guidelines. They state they will restrict access to models/code and used publicly available data.