Papers
Topics
Authors
Recent
Search
2000 character limit reached

EchoMotion: Unified Human Video Synthesis

Updated 6 March 2026
  • EchoMotion is a unified framework for human-centric video generation that jointly models visual appearance and articulated 3D motion.
  • It employs a dual-modality Diffusion Transformer and synchronized 3D positional encoding to ensure precise temporal alignment and kinematic fidelity.
  • A two-stage training strategy, beginning with motion-only pretraining, refines joint video-motion generation, setting new benchmarks in anatomical plausibility and video quality.

EchoMotion is a unified framework for human-centric video generation that jointly models appearance and articulated 3D motion. Departing from conventional pixel-only video generation pipelines, EchoMotion addresses the challenge of synthesizing complex human actions by learning the joint distribution of human motion and visual appearance in a dual-modality Diffusion Transformer (DiT) architecture. The system introduces a synchronized positional encoding scheme, a two-stage training regime, and leverages a large-scale paired dataset to markedly improve the anatomical plausibility, temporal coherence, and controllability of generated human videos (Yang et al., 21 Dec 2025).

1. Dual-Modality Architecture

EchoMotion extends pretrained video DiT backbones (e.g., Wan2.1 or Wan2.2) with a set of dual-branch “video-motion” blocks. Each block maintains separate linear projections for video and motion tokens: QvQ_v, KvK_v, VvV_v for visual tokens from VAE-latent videos, and QmQ_m, KmK_m, VmV_m for motion tokens derived from SMPL parameters representing full-body 3D kinematics. These are concatenated along the token dimension for joint multi-head self-attention: Qmm=[Qv;Qm],Kmm=[Kv;Km],Vmm=[Vv;Vm]Q_{mm} = [Q_v; Q_m],\quad K_{mm} = [K_v; K_m],\quad V_{mm} = [V_v; V_m] After cross-modal attention, the resulting features are split back into the respective modalities, each further processed by modality-specific cross-attention (for text prompts) and feed-forward network (FFN) layers.

The token streams differ in resolution: visual tokens are temporally downsampled by a factor of 4, while 51 motion tokens, per frame, retain per-frame temporal granularity. SMPL parameters (shape βR10\beta \in \mathbb{R}^{10}, rotation θR24×6\theta \in \mathbb{R}^{24 \times 6}, global γR6\gamma \in \mathbb{R}^6, root vR3v \in \mathbb{R}^3, joints ηR24×3\eta \in \mathbb{R}^{24 \times 3}) are projected into this latent space using small MLPs.

2. Motion-Video Synchronized RoPE (MVS-RoPE)

EchoMotion introduces Motion-Video Synchronized Rotary Positional Encoding (MVS-RoPE) to enable unified 3D positional embeddings for both modalities. MVS-RoPE creates a temporally aligned embedding by:

  • Applying standard 3D RoPE R\mathcal{R} to visual tokens at coordinates (t,h,w)(t, h, w):

f^t,h,wv=R(t,h,w)ft,h,wv\hat{f}^v_{t,h,w} = \mathcal{R}(t,h,w)\, f^v_{t,h,w}

  • For motion tokens at time tt, token index ii:

f^t,im=R(t/4,H+i,W+i)ft,im\hat{f}^m_{t,i} = \mathcal{R}(t/4, H+i, W+i)\, f^m_{t,i}

Here, the (H+i,W+i)(H+i, W+i) spatial offset prevents index collisions with the H×WH \times W video grid, while t/4t/4 temporally synchronizes motion (with higher frame rate) and video tokens. This scheme preserves pretrained RoPE behavior for video while enforcing precise modality separation and 1:4 temporal alignment.

3. Motion-Video Two-Stage Training Strategy

Training proceeds in two distinct phases:

Phase 1 (Motion-Only Pretraining): The video stream is frozen. The motion branch is trained using only motion data (HuMoVe and HumanML3D), optimizing a flow-matching MSE loss: Lmotion=Em0,m1,t uθ(mt,(y),t)(m1m0)2\mathcal{L}_{\text{motion}} = \mathbb{E}_{m_0, m_1, t}\, \|\ u_\theta(m_t,(y), t) - (m_1 - m_0)\|^2 This avoids interference from the larger video branch.

Phase 2 (Motion-Video Multi-Task Training): Both streams are unfrozen and trained on (video, motion, text) triples:

  • Joint generation (predict video xx and motion mm from text yy)
  • Motion\rightarrowVideo (condition on mm, predict xx)
  • Video\rightarrowMotion (condition on xx, predict mm)

A one-hot task embedding is added to the latent. Clean (undisturbed) latents are provided for conditioning modalities. The loss remains a flow-matching MSE over whichever modality is targeted.

In-Context Classifier-Free Guidance (ICCFG): During multi-task training, conditions (text, motion, video) are randomly dropped depending on the task. At sampling, guidance scales (ω1,ω2)(\omega_1, \omega_2) are applied to interpolate between various conditional generations, enabling the model to toggle between joint and cross-modal synthesis without additional control modules.

4. Key Design Principles and Inductive Biases

Several architectural and training choices in EchoMotion contribute critical inductive biases:

  • Parametric SMPL encoding yields highly token-efficient, kinematically faithful motion representations compared to optical flow.
  • Dual-branch projections insulate video and motion features, reducing harmful feature interference.
  • MVS-RoPE ensures sharply diagonal attention maps and strict temporal alignment between modalities, as visualized in attention maps.
  • Two-stage schedule establishes a strong unimodal motion prior, mitigating the risk of the video stream overpowering motion signals during joint modeling.
  • ICCFG enables flexible task switching for unconditional, text-conditioned, motion-conditioned, and joint generations within a unified model.

5. HuMoVe Dataset

EchoMotion’s training relies on HuMoVe, a large-scale corpus comprising approximately 80,000 human-centric video clips of 1–5 seconds each (16–24 fps). Each clip contains:

  • Video frames compressed via VAE
  • SMPL motion tracks with per-frame parameters (shape, rotation, global, root, and 3D skeleton)
  • Granular text captions detailing (i) subject appearance/clothing, (ii) background, (iii) precise action

The dataset is organized into 9 major categories and 38 subcategories. Stage 1 employs only the motion modality; stage 2 utilizes the full (video, motion, text) triplets.

6. Empirical Results and Capabilities

EchoMotion demonstrates state-of-the-art performance in several benchmarks:

A. Text\rightarrowVideo (∼270 prompts, Wan-5B scale):

  • Human Anatomy: 85.1 (EchoMotion) vs. 83.0/83.1 (baselines)
  • Motion Smoothness: 99.3 vs. 98.9/98.7
  • Dynamic Degree: 64.0 vs. 62.2/63.1
  • Aesthetic Quality: 58.3 (parity)

Human Evaluation (1–100):

  • Video Quality: 81.0 (EchoMotion) vs. 72.8/72.3
  • Prompt Following: 81.5 vs. 78.9/79.6
  • Posture Plausibility: 81.6 vs. 68.9/70.2

B. Motion\rightarrowVideo (VACE-Benchmark):

  • Aesthetic Quality: 59.2 (EchoMotion), 61.4 (Animate-14B)
  • Motion Smoothness: 99.1, 98.9
  • Pose Consistency: 82.2, 87.2

C. Video\rightarrowMotion (3DPW test):

  • PA-MPJPE: 59.8 mm (EchoMotion), 58.7 mm (ChatHuman)
  • MPJPE: 94.1 mm, 91.3 mm

D. Motion-Only Generation:

  • FID: 10.9/11.3 (1.3B/5B params)
  • Pose Plausibility: 79.5/80.8
  • Prompt Following: 78.8/82.3
  • Motion Smoothness: 92.2/92.4

Qualitative assessments show that EchoMotion corrects physically implausible poses, maintains anatomical structure in challenging actions, and faithfully executes complex, multi-step instructions.

7. Implications and Research Significance

EchoMotion demonstrates that explicit representation of 3D human motion is highly complementary to traditional pixel-space modeling for video generation. The unified DiT, synchronized 3D positional encoding, and two-stage multitask training collectively enable significant gains in generating anatomically correct and temporally consistent videos, as well as supporting bidirectional motion\leftrightarrowvideo control with a single model (Yang et al., 21 Dec 2025). The dual-modality design and the use of expansive paired datasets establish new performance baselines and architectural standards for generative modeling of articulated human agents. This suggests that further advancements in human video generation may depend on even tighter integration of appearance and kinematic modeling, as well as more sophisticated multi-modal training regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EchoMotion.