EchoMotion: Unified Human Video Synthesis

Updated 6 March 2026

EchoMotion is a unified framework for human-centric video generation that jointly models visual appearance and articulated 3D motion.
It employs a dual-modality Diffusion Transformer and synchronized 3D positional encoding to ensure precise temporal alignment and kinematic fidelity.
A two-stage training strategy, beginning with motion-only pretraining, refines joint video-motion generation, setting new benchmarks in anatomical plausibility and video quality.

EchoMotion is a unified framework for human-centric video generation that jointly models appearance and articulated 3D motion. Departing from conventional pixel-only video generation pipelines, EchoMotion addresses the challenge of synthesizing complex human actions by learning the joint distribution of human motion and visual appearance in a dual-modality Diffusion Transformer (DiT) architecture. The system introduces a synchronized positional encoding scheme, a two-stage training regime, and leverages a large-scale paired dataset to markedly improve the anatomical plausibility, temporal coherence, and controllability of generated human videos (Yang et al., 21 Dec 2025).

1. Dual-Modality Architecture

EchoMotion extends pretrained video DiT backbones (e.g., Wan2.1 or Wan2.2) with a set of dual-branch “video-motion” blocks. Each block maintains separate linear projections for video and motion tokens: $Q_v$ , $K_v$ , $V_v$ for visual tokens from VAE-latent videos, and $Q_m$ , $K_m$ , $V_m$ for motion tokens derived from SMPL parameters representing full-body 3D kinematics. These are concatenated along the token dimension for joint multi-head self-attention: $Q_{mm} = [Q_v; Q_m],\quad K_{mm} = [K_v; K_m],\quad V_{mm} = [V_v; V_m]$ After cross-modal attention, the resulting features are split back into the respective modalities, each further processed by modality-specific cross-attention (for text prompts) and feed-forward network (FFN) layers.

The token streams differ in resolution: visual tokens are temporally downsampled by a factor of 4, while 51 motion tokens, per frame, retain per-frame temporal granularity. SMPL parameters (shape $\beta \in \mathbb{R}^{10}$ , rotation $\theta \in \mathbb{R}^{24 \times 6}$ , global $\gamma \in \mathbb{R}^6$ , root $v \in \mathbb{R}^3$ , joints $\eta \in \mathbb{R}^{24 \times 3}$ ) are projected into this latent space using small MLPs.

2. Motion-Video Synchronized RoPE (MVS-RoPE)

EchoMotion introduces Motion-Video Synchronized Rotary Positional Encoding (MVS-RoPE) to enable unified 3D positional embeddings for both modalities. MVS-RoPE creates a temporally aligned embedding by:

Applying standard 3D RoPE $\mathcal{R}$ to visual tokens at coordinates $(t, h, w)$ :

$\hat{f}^v_{t,h,w} = \mathcal{R}(t,h,w)\, f^v_{t,h,w}$

For motion tokens at time $t$ , token index $i$ :

$\hat{f}^m_{t,i} = \mathcal{R}(t/4, H+i, W+i)\, f^m_{t,i}$

Here, the $(H+i, W+i)$ spatial offset prevents index collisions with the $H \times W$ video grid, while $t/4$ temporally synchronizes motion (with higher frame rate) and video tokens. This scheme preserves pretrained RoPE behavior for video while enforcing precise modality separation and 1:4 temporal alignment.

3. Motion-Video Two-Stage Training Strategy

Training proceeds in two distinct phases:

Phase 1 (Motion-Only Pretraining): The video stream is frozen. The motion branch is trained using only motion data (HuMoVe and HumanML3D), optimizing a flow-matching MSE loss: $\mathcal{L}_{\text{motion}} = \mathbb{E}_{m_0, m_1, t}\, \|\ u_\theta(m_t,(y), t) - (m_1 - m_0)\|^2$ This avoids interference from the larger video branch.

Phase 2 (Motion-Video Multi-Task Training): Both streams are unfrozen and trained on (video, motion, text) triples:

Joint generation (predict video $x$ and motion $m$ from text $y$ )
Motion $\rightarrow$ Video (condition on $m$ , predict $x$ )
Video $\rightarrow$ Motion (condition on $x$ , predict $m$ )

A one-hot task embedding is added to the latent. Clean (undisturbed) latents are provided for conditioning modalities. The loss remains a flow-matching MSE over whichever modality is targeted.

In-Context Classifier-Free Guidance (ICCFG): During multi-task training, conditions (text, motion, video) are randomly dropped depending on the task. At sampling, guidance scales $(\omega_1, \omega_2)$ are applied to interpolate between various conditional generations, enabling the model to toggle between joint and cross-modal synthesis without additional control modules.

4. Key Design Principles and Inductive Biases

Several architectural and training choices in EchoMotion contribute critical inductive biases:

Parametric SMPL encoding yields highly token-efficient, kinematically faithful motion representations compared to optical flow.
Dual-branch projections insulate video and motion features, reducing harmful feature interference.
MVS-RoPE ensures sharply diagonal attention maps and strict temporal alignment between modalities, as visualized in attention maps.
Two-stage schedule establishes a strong unimodal motion prior, mitigating the risk of the video stream overpowering motion signals during joint modeling.
ICCFG enables flexible task switching for unconditional, text-conditioned, motion-conditioned, and joint generations within a unified model.

5. HuMoVe Dataset

EchoMotion’s training relies on HuMoVe, a large-scale corpus comprising approximately 80,000 human-centric video clips of 1–5 seconds each (16–24 fps). Each clip contains:

Video frames compressed via VAE
SMPL motion tracks with per-frame parameters (shape, rotation, global, root, and 3D skeleton)
Granular text captions detailing (i) subject appearance/clothing, (ii) background, (iii) precise action

The dataset is organized into 9 major categories and 38 subcategories. Stage 1 employs only the motion modality; stage 2 utilizes the full (video, motion, text) triplets.

6. Empirical Results and Capabilities

EchoMotion demonstrates state-of-the-art performance in several benchmarks:

A. Text $\rightarrow$ Video (∼270 prompts, Wan-5B scale):

Human Anatomy: 85.1 (EchoMotion) vs. 83.0/83.1 (baselines)
Motion Smoothness: 99.3 vs. 98.9/98.7
Dynamic Degree: 64.0 vs. 62.2/63.1
Aesthetic Quality: 58.3 (parity)

Human Evaluation (1–100):

Video Quality: 81.0 (EchoMotion) vs. 72.8/72.3
Prompt Following: 81.5 vs. 78.9/79.6
Posture Plausibility: 81.6 vs. 68.9/70.2

B. Motion $\rightarrow$ Video (VACE-Benchmark):

Aesthetic Quality: 59.2 (EchoMotion), 61.4 (Animate-14B)
Motion Smoothness: 99.1, 98.9
Pose Consistency: 82.2, 87.2

C. Video $\rightarrow$ Motion (3DPW test):

PA-MPJPE: 59.8 mm (EchoMotion), 58.7 mm (ChatHuman)
MPJPE: 94.1 mm, 91.3 mm

D. Motion-Only Generation:

FID: 10.9/11.3 (1.3B/5B params)
Pose Plausibility: 79.5/80.8
Prompt Following: 78.8/82.3
Motion Smoothness: 92.2/92.4

Qualitative assessments show that EchoMotion corrects physically implausible poses, maintains anatomical structure in challenging actions, and faithfully executes complex, multi-step instructions.

7. Implications and Research Significance

EchoMotion demonstrates that explicit representation of 3D human motion is highly complementary to traditional pixel-space modeling for video generation. The unified DiT, synchronized 3D positional encoding, and two-stage multitask training collectively enable significant gains in generating anatomically correct and temporally consistent videos, as well as supporting bidirectional motion $\leftrightarrow$ video control with a single model (Yang et al., 21 Dec 2025). The dual-modality design and the use of expansive paired datasets establish new performance baselines and architectural standards for generative modeling of articulated human agents. This suggests that further advancements in human video generation may depend on even tighter integration of appearance and kinematic modeling, as well as more sophisticated multi-modal training regimes.

Markdown Report Issue Upgrade to Chat

References (1)

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EchoMotion.

EchoMotion: Unified Human Video Synthesis

1. Dual-Modality Architecture

2. Motion-Video Synchronized RoPE (MVS-RoPE)

3. Motion-Video Two-Stage Training Strategy

4. Key Design Principles and Inductive Biases

5. HuMoVe Dataset

6. Empirical Results and Capabilities

7. Implications and Research Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EchoMotion: Unified Human Video Synthesis

1. Dual-Modality Architecture

2. Motion-Video Synchronized RoPE (MVS-RoPE)

3. Motion-Video Two-Stage Training Strategy

4. Key Design Principles and Inductive Biases

5. HuMoVe Dataset

6. Empirical Results and Capabilities

7. Implications and Research Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research