DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance (2504.01724v3)

Published 2 Apr 2025 in cs.CV and cs.AI

Abstract: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

Summary

The paper introduces DreamActor-M1, a framework that overcomes fine-grained control challenges in human image animation via hybrid motion and appearance guidance.
It employs a Diffusion Transformer with flow matching in a 3D VAE latent space and a progressive training strategy to ensure multi-scale adaptability and temporal coherence.
Evaluation results show DreamActor-M1 outperforms state-of-the-art methods in metrics like FID, SSIM, and PSNR, achieving superior identity preservation and visual fidelity.

This paper introduces DreamActor-M1, a framework for human image animation designed to address limitations in existing methods regarding fine-grained holistic control, multi-scale adaptability, and long-term temporal coherence. The goal is to generate expressive and robust animations from a single image or multiple reference images, driven by motion sequences.

Core Problem:

Existing human animation techniques struggle with:

Fine-grained Control: Difficulty in accurately controlling subtle facial expressions (like eye blinks, lip tremors) simultaneously with body movements.
Multi-Scale Adaptability: Lack of robustness when handling inputs of varying scales (portrait, upper-body, full-body) within a single model.
Temporal Coherence: Maintaining consistency, especially for unseen parts (like the back of clothing), during long video generation or complex movements like rotations.

Proposed Solution: DreamActor-M1

DreamActor-M1 is based on a Diffusion Transformer (DiT) architecture, specifically leveraging a pre-trained image-to-video DiT model (Seaweed (2501.08316)) and utilizing Flow Matching for training within the latent space of a 3D VAE. It introduces three key innovations:

Hybrid Motion Guidance: To achieve fine-grained control, the system uses a combination of control signals:
- Implicit Facial Representations: A pre-trained face motion encoder (based on PDFGC (2306.05196)) extracts identity-independent expression features from cropped faces in the driving video. These implicit tokens control facial expressions via cross-attention within the DiT blocks, decoupling expression from identity and head pose. An optional audio-driven encoder can map speech to these tokens for lip-syncing without a driving video.
- 3D Head Spheres: Explicitly control head pose (rotation and scale). 3D facial parameters are extracted (using FaceVerse (2205.15175)), and a colored 2D sphere projection representing the head's position, scale (matched to reference), and orientation (encoded by color) is generated. This decouples head pose from facial expression control.
- 3D Body Skeletons: Control body movements. SMPL-X parameters are estimated (using 4DHumans (2310.17080) and HaMeR (2405.17410)), projected to 2D skeletons, and fed into a pose encoder. Using skeletons instead of full meshes allows the model to learn character shape from references. Bone length adjustment is performed during inference by comparing reference and driving subject proportions in an A-pose (using RTMPose (2303.07399) and SeedEdit (2411.06686)) to improve anatomical alignment.
Complementary Appearance Guidance: To ensure long-term consistency and handle unseen areas (e.g., during turns), a multi-reference injection protocol is proposed:
- Training: Three frames representing diverse viewpoints (max, min, median yaw rotation) plus an optional cropped half-body frame (for full-body inputs) are selected as references.
- Inference (Optional Two-Stage): For challenging cases (e.g., animating a full-body turn from a frontal half-body reference), the model first generates a multi-view sequence ("pseudo-references"). Key frames are selected from this sequence and used as multiple references in a second generation pass, providing richer appearance information for unseen regions. Reference features are injected into the DiT not via a separate ReferenceNet, but by concatenating reference and video latents and using self-attention followed by cross-attention between reference and denoising tokens.
Progressive Training Strategy: To handle multi-scale data and ensure stable learning, training occurs in three stages:
- Stage 1: Train with only 3D body skeletons and head spheres to adapt the base model to human animation.
- Stage 2: Freeze most parameters and train only the face motion encoder and face attention layers, introducing implicit facial control.
- Stage 3: Unfreeze all parameters and fine-tune the entire model jointly. Training uses a diverse 500-hour dataset (dancing, sports, films, speeches) with varying resolutions and clip lengths (25-121 frames) resized to a $960 \times 640$ area.

Implementation & Architecture:

Backbone: MMDiT (2406.00455) pre-trained on image-to-video.
Training Objective: Flow Matching (2210.02747).
Latent Space: Pre-trained 3D VAE (2310.05737).
Reference Injection: Concatenated latent patches (reference + video) processed via self-attention and cross-attention within DiT blocks.
Motion Control Injection: Implicit face tokens via cross-attention; Pose features (encoded head sphere + skeleton) concatenated with noise tokens.
Inference: Generates 73-frame segments, using the last latent of a segment to initialize the next for consistency. Classifier-free guidance (CFG) is used for reference and motion signals.

if use_two_stage_inference:
    pseudo_references = generate_pseudo_references(I_R, V_drive)
    I_R = select_frames(pseudo_references) # Use generated frames as multi-ref

face_crops = crop_faces(V_drive)
implicit_face_tokens = face_motion_encoder(face_crops)

head_params = extract_3d_head_params(V_drive) # e.g., using FaceVerse
head_spheres = render_head_spheres(head_params, I_R) # Match scale to reference

body_params = estimate_smplx(V_drive) # e.g., using 4DHumans, HaMeR
raw_skeletons = project_to_2d_skeletons(body_params)
adjusted_skeletons = adjust_bone_length(raw_skeletons, I_R) # Compare drive/ref in A-pose

pose_maps = concatenate(adjusted_skeletons, head_spheres, dim=channel)
pose_features = pose_encoder(pose_maps)

ref_latent = vae_encoder(I_R)
noise_latent = sample_gaussian_noise(shape=video_shape)

output_video_latents = []
current_latent = noise_latent
for segment_idx in range(num_segments):
    segment_pose_features = pose_features[segment_idx]
    segment_face_tokens = implicit_face_tokens[segment_idx]

    # Flow Matching / Diffusion Denoising Steps
    for t in timesteps:
        # Prepare input tokens
        input_tokens = prepare_dit_input(ref_latent, current_latent, segment_pose_features, t)
        # Inject face tokens via cross-attention
        output_latent = dit_model(input_tokens, face_tokens=segment_face_tokens, timestep=t)
        current_latent = output_latent # Update latent for next step/segment

    output_video_latents.append(current_latent)
    # Use last frame latent to initialize next segment (if not last segment)
    if segment_idx < num_segments - 1:
        current_latent = current_latent[-1:, ...] # Get latent of last frame

final_video_latents = concatenate(output_video_latents)
output_video = vae_decoder(final_video_latents)

return output_video

Evaluation:

DreamActor-M1 was compared against state-of-the-art body animation (Animate Anyone (2311.10324), Champ (2405.11368), MimicMotion (2406.19680), DisPose (2412.09349)) and portrait animation methods (LivePortrait (2407.03168), X-Portrait (2405.03179), SkyReels-A1 (2502.10841), Runway Act-One). Quantitative results (FID, SSIM, PSNR, LPIPS, FVD) on a collected dataset showed DreamActor-M1 outperformed competitors in both categories. Qualitative results demonstrated better fine-grained motion, identity preservation, temporal consistency, and fidelity. Ablation studies confirmed the benefits of the multi-reference protocol (especially pseudo-references for long videos) and the hybrid control signals (implicit face features superior to landmarks, 3D skeletons/spheres superior to 3D mesh).

Limitations:

Difficulty controlling dynamic camera movements.
Inability to generate physical interactions with objects.
Bone length adjustment can be unstable in edge cases, sometimes requiring manual intervention.

Ethics:

The paper acknowledges the potential misuse for creating fake videos and emphasizes the need for ethical guidelines. They state they will restrict access to models/code and used publicly available data.