SynMotion: Custom Video Motion Synthesis

Updated 1 July 2025

SynMotion is a motion-customized video synthesis framework that disentangles subject and motion semantics through a dual-embedding mechanism.
It integrates efficient visual adapters within a pre-trained video diffusion generator to achieve refined motion control and enhanced temporal consistency.
The framework demonstrates state-of-the-art performance on MotionBench by achieving high motion accuracy, subject fidelity, and robust generalization across diverse scenarios.

SynMotion is a motion-customized video generation framework that enables precise transfer, adaptation, and control of human (and other subject) motions in video synthesis by combining dual-level semantic disambiguation with parameter-efficient visual adaptation. Addressing the limitations of semantic-only, visual-only, and naive concatenative approaches to video motion customization, SynMotion introduces a suite of architectural innovations and training paradigms that promote specification, diversity, and fidelity of both motion and subject appearance. Its methodologies are grounded in a dual-embedding semantic comprehension mechanism, parameter-efficient visual denoising adapters, a subject-prior regularization strategy, and a new benchmark for systematic evaluation.

1. Model Architecture and Component Integration

At its core, SynMotion builds on a pre-trained MM-DiT-based video diffusion generator (using HunyuanVideo as base) and introduces two complementary adaptation pathways:

Semantic Pathway: Text prompts structured as "<subject, motion>" are processed using a Multimodal LLM (MLLM). These are decomposed into a subject embedding ( $e_{sub}$ ) and a motion embedding ( $e_{mot}$ ), incorporating learnable residuals and a dedicated embedding refiner. The mechanism ensures flexible recombination and discriminative control, allowing learned motion features to be reused with new or arbitrary subjects.
Visual Pathway: Specialized, lightweight, trainable motion adapters (low-rank adapters) are inserted into each denoising block within the frozen video backbone. These adapters adjust only a small subset of weights ( $\tilde{\mathbf{W}_*} = \mathbf{W}_* + \mathbf{B}_* \mathbf{A}_*$ ), where $\mathbf{A}_*$ , $\mathbf{B}_*$ are small trainable matrices, infusing the generative process with enhanced motion realism and temporal smoothness.

The joint pipeline is trained by minimizing

${\mathcal L} = \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0,1), e_\theta, t} \left[\left\|\epsilon-\epsilon_\theta(z_t, e_\theta, t)\right\|_2^2\right],$

where $z_t$ is the noisy latent, $e_\theta$ the subject-motion embedding, and $\epsilon_\theta$ the denoiser prediction.

This architecture enables explicit subject-motion separation, fine-grained motion customization, and scalable adaptation to new scenarios with minimal additional data.

2. Dual-Embedding Semantic Comprehension Mechanism

The dual-embedding mechanism systematically disentangles subject and motion semantics:

Input decomposition: Given a prompt like “<subject, motion>”, the MLLM encoder outputs a joint feature that is split into subject ( $e_{sub}$ ) and motion ( $e_{mot}$ ) components.
Residual enhancement & refinement: Each component is further augmented by a trainable residual ( $e_{sub}^l, e_{mot}^l$ ) and passed through a Zero-convolution ( $\mathcal{Z}$ ). The subject residual is randomly initialized to foster subject variation, while motion residuals are initialized from phrase embeddings (e.g., "a person claps") to ensure semantic grounding.
Refiner module: An embedding refiner ( $\mathcal{R}$ ) integrates subject and motion representations, supporting bidirectional context.

The composite embedding is: $e = [e_{mot} + \mathcal{Z}(e_{mot}^l),\ e_{sub} + \mathcal{Z}(e_{sub}^l)], \qquad e' = e + \mathcal{Z}(\mathcal{R}(e)).$

This mechanism is crucial for avoiding semantic confusion—frequent in earlier approaches that use one embedding for both motion and subject—allowing for more accurate motion transfer across arbitrary subjects and supporting cross-domain generalization in both text-to-video (T2V) and image-to-video (I2V) settings.

3. Parameter-Efficient Visual Motion Adaptation

To address the observed limitations of semantic-only subject-motion transfer—especially inadequate motion fidelity and lack of temporal coherence—SynMotion employs trainable low-rank adapters within the video diffusion backbone:

Adapter Structure: In every self-/cross-attention block, for each weight $\mathbf{W}_*$ ( $* \in \{Q, K, V\}$ ), it computes $\tilde{\mathbf{W}_*} = \mathbf{W}_* + \mathbf{B}_* \mathbf{A}_*$ , with $\mathbf{A}_*\in \mathbb{R}^{r\times d}$ , $\mathbf{B}_*\in \mathbb{R}^{d\times r}$ , $r \ll d$ . All other (original) model weights are frozen.
Role: These adapters allow effective, high-fidelity adaptation to the target motion while maintaining subject identity and global content, without catastrophic forgetting.
Benefits: They support visually plausible amplitude, timing, and subtlety in motion sequences, yielding better performance in both dynamic and static subject rendering, particularly for rare or complex actions.

This parameter efficiency also permits rapid, scalable adaptation to new subject-motion pairs, supporting practical and industrial use cases.

4. Embedding-Specific Alternate Training with Subject Prior Videos

SynMotion introduces an alternately optimized embedding training strategy to balance subject generalization and motion specificity:

Subject Prior Video (SPV) Dataset: A curated set of videos combining diverse subjects (animals, humans, objects) with generic motions. The SPV dataset is used to regularize the subject embedding and prevent overfitting to target-specific features.
Alternate Training: During training, with probability $\alpha$ the model is trained on target motion samples (updating both $e_{mot}^l$ and $e_{sub}^l$ ), and with probability $1-\alpha$ on SPV videos (updating only $e_{sub}^l$ , freezing $e_{mot}^l$ ).
Rationale: This division ensures that motion embeddings remain motion-specific, while subject embeddings maintain broader generalization capacity.

A plausible implication is that this mechanism prevents semantic collapse (where the model merges subject and motion roles) and supports robust transfer even for previously unseen or out-of-distribution subjects and motions.

5. MotionBench: A Benchmark for Motion-Customized Video Generation

MotionBench is a rigorously curated dataset and protocol developed to standardize evaluation for motion-customized video generation:

Content: Encompasses 16 challenging motion categories, each paired with 6–10 diverse real-world videos. Motions were vetted for difficulty by verifying failure of existing SOTA models (e.g., HunyuanVideo), ensuring genuine evaluation challenge.
Prompt Structure: All queries have the format "<subject, motion>", supporting compositional testing (subject-motion recombination).
Purpose: Provides a robust testbed for cross-subject generalization, motion transfer specificity, motion consistency, and visual quality across both T2V and I2V paradigms.

This benchmark is necessary because conventional evaluation datasets did not adequately capture the complexity or compositional requirements of flexible motion transfer.

6. Experimental Results and Outperformance

Extensive experiments on MotionBench and in generalized I2V settings demonstrate that SynMotion achieves state-of-the-art (SOTA) results:

Motion Accuracy: 68.60% (T2V, MotionBench), significantly higher than all baselines.
Subject Accuracy and Consistency: 97.67% and 98.28% (I2V), surpassing other adaptive and subject-centric video models.
Dynamic Degree and Imaging Quality: 88.24%, 69.47% (T2V), showing high-quality, temporally coherent outputs.
Aesthetics and Generalization: Outperforms models like MotionInversion, DreamBooth, and CogVideoX-I2V in both metrics and generalization, including rare cross-domain subject-motion pairs.

Ablation experiments confirm that each module (dual-embedding, refiner, adapter, alternate training) substantially contributes to overall performance. The model is robust to out-of-distribution queries and supports direct I2V motion transfer, generating dynamic, realistic motion sequences from a single template image.

7. Significance and Position within the Field

SynMotion establishes a comprehensive standard and methodology for motion-specific video synthesis:

Contribution	Description	Impact
Dual-embedding decomposition	Disentangles motion and subject semantics	Supports precise transfer and generality
Visual adapters	Lightweight, parameter-efficient visual fine-tuning	Enhances fidelity/coherence, fast tuning
Alternate embedding training	Regularizes and balances subject-motion specificity	Prevents semantic/visual interference
MotionBench	Standardized, realistic, and challenging benchmark	Supports reproducible evaluation
SOTA T2V and I2V performance	Quantitative and qualitative advances over baselines	Sets new research/industry benchmark

This work provides a rigorous and scalable approach for composing, transferring, and evaluating arbitrary subject-motion pairs in video synthesis, and positions semantic-visual adaptation as a key paradigm in motion-customized generative video models.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now