Semantic Motion Classes for Video Synthesis

Updated 28 October 2025

Semantic motion classes are structured latent representations that capture high-level motion intents (e.g., 'wave', 'punch') and facilitate cross-domain transfer.
A dual-embedding mechanism disentangles subject identity from motion semantics, enabling precise motion transfer and adaptable video synthesis.
The training process integrates exemplar-based optimization with visual adapters to enhance fidelity, diversity, and compositional motion generation.

Semantic motion classes are structured, often latent, representations that capture the meaningful, transferable essence of motion events or actions across subjects, objects, or domains. They formalize the concept that high-level motion semantics—such as "clap", "wave", or "punch"—can be learned, encoded, and applied to diverse video generation and analysis tasks. In the context of motion-customized video generation, semantic motion classes enable precise motion transfer and synthesis by establishing clear, disentangled boundaries between the semantics of "what is moving" (the subject) and "how it moves" (the motion class), a distinction critical for generalization and fidelity in generative models.

1. Formalization and Role of Semantic Motion Classes

In SynMotion (Tan et al., 30 Jun 2025), semantic motion classes are implemented as motion-specific embeddings that are learned from a combination of a few exemplar videos and their associated textual prompts. These classes function as structured latent vectors, encapsulating motion concepts at a level that is transferable beyond the training set.

Given a text prompt of the form $<\mathrm{subject}, \mathrm{motion}>$ (for example, "a cat dances"), a Multimodal LLM (MLLM) encodes the prompt, yielding a joint semantic embedding. This embedding is then decomposed to represent, separately, the intended motion class and the subject identity, providing the foundation for downstream generation and adaptation.

Semantic motion classes serve three main purposes in this context:

Enabling transferability: Motions such as "clap" can be learned from one subject (e.g., a human) and transferred to others (e.g., a dog).
Supporting compositionality: Classes can be recombined arbitrarily with diverse subject representations to generate novel, semantically consistent videos.
Grounding visual complexity: By associating these classes with learnable embeddings, the system anchors high-level motion intent in precise, model-driven patterns that inform the generative process.

2. Dual-Embedding Mechanism for Disentanglement

A central technical innovation is the dual-embedding semantic comprehension mechanism, which prevents interference and entanglement between subject and motion semantics. This mechanism involves:

Decomposition: The semantic representation from the MLLM is split into:
- Subject embedding ( $e_{\text{sub}}$ ): Encodes subject identity
- Motion embedding ( $e_{\text{mot}}$ ): Encodes the semantic motion class
Zero-Conv residual augmentation: Both embeddings are augmented by learnable, zero-initialized residuals ( $\mathcal{Z}(\cdot)$ $Z (\cdot)$ ), allowing the model to specialize:
- $e^l_{\text{mot}}$ : Motion latent, initialized from the phrase embedding
- $e^l_{\text{sub}}$ : Subject latent, randomly initialized to promote generalization
Fusion and refinement: An Embedding Refiner module ( $\mathcal{R}$ ) processes the concatenated and augmented embeddings, producing refined embeddings $e'$ with the form:

$e' = [e_{\text{mot}} + \mathcal{Z}(e^l_{\text{mot}}),\; e_{\text{sub}} + \mathcal{Z}(e^l_{\text{sub}})] + \mathcal{Z}(\mathcal{R}([e_{\text{mot}} + \mathcal{Z}(e^l_{\text{mot}}),\; e_{\text{sub}} + \mathcal{Z}(e^l_{\text{sub}})]))$

This strategy effectively disentangles subject and motion, ensuring that the model can specialize embeddings for complex motion classes (e.g., multi-limb, high-frequency actions) while maintaining broad subject coverage for generalization.

3. Learning Customized Motion Features

Customized motion features are acquired using an alternating, embedding-specific training scheme:

Initialization: Motion embedding is grounded in the phrase embedding; subject embedding is noise-initialized.
Alternating optimization:
- With user-provided exemplars, both subject and motion embeddings are jointly updated to fit the desired semantics.
- With auxiliary Subject Prior Videos (SPV, a manually constructed training set of diverse subjects performing common motions), only the subject embedding is updated while the motion embedding is frozen. This ensures that motion embeddings remain motion-specific, while subject embeddings become generalized.
Sampling strategy: A hyperparameter $\alpha$ is used to regulate the mix between customization samples and SPV in each batch, controlling the balance between specificity and generalization.

This regime allows the motion embeddings to acquire fine-grained, motion-specific attributes while avoiding overfitting to particular subjects.

4. Integration of Semantic Guidance with Visual Adaptation

The semantic embeddings act as high-level guidance, but visual fidelity is enhanced through a dedicated visual adaptation module:

Motion-Aware Adapter: Each attention layer in the video generation backbone (MM-DiT) is augmented with a parameter-efficient, low-rank adaptation module:

$\tilde{\mathbf{W}}_* = \mathbf{W}_* + \mathbf{B}_* \mathbf{A}_*$

where $*$ indexes query/key/value projections, $\mathbf{A}_* \in \mathbb{R}^{r \times d}$ and $\mathbf{B}_* \in \mathbb{R}^{d \times r}$ with $r \ll d$ . This enables the model to encode complex motion details in the visual stream without altering the generative distribution of the underlying backbone.

Pipeline: The combined embedding $e'$ is injected into the denoising diffusion process alongside the motion-aware adapted features, ensuring that generation is conditioned on both semantic intent (motion class, subject) and visually accurate motion cues, promoting both motion fidelity and subject diversity.

5. Mathematical Characterization and Objective

The entire training process is driven by a diffusion model objective: $\mathcal{L} = \mathbb{E}_{\mathcal{E}(x), \epsilon \sim \mathcal{N}(0,1), e_\theta, t}\left[\left\|\epsilon-\epsilon_{\theta}(z_t, e_\theta, t)\right\|_{2}^{2}\right]$ where $e_\theta$ are the fused semantic embeddings (motion class and subject). Adapter updates and embedding refinement are incorporated at both forward and backward passes, maintaining separation of concerns and supporting stable optimization.

6. Benchmarking and Evaluation with MotionBench

MotionBench is a standardized benchmark constructed to rigorously evaluate the capacity of video generation systems to learn and synthesize semantic motion classes. It features:

16 motion categories with 6–10 exemplar videos per class, spanning humans, animals, and objects.
Dual roles: Used as both a training (for SPV) and an evaluation set for text-to-video (T2V) and image-to-video (I2V) setups.
Metrics: Evaluation includes motion accuracy/consistency (does output behavior match target motion class?), subject accuracy/consistency (is the requested subject preserved?), and dynamic degree/background consistency (quality of motion, absence of spurious dynamics/background disruption).

Empirically, SynMotion demonstrates improved fidelity, diversity, and semantic alignment over baselines, particularly in transferring complex motions to novel and visually diverse subjects.

Summary Table: SynMotion's Semantic Motion Class Framework

Component	Role	Mathematical Detail
Semantic Motion Classes	Transferable latent motion concepts	Motion embedding $e^l_{\mathrm{mot}}$ , prompt-aware initialization
Dual-Embedding Mechanism	Disentangles subject and motion semantics	$e=[e_{\mathrm{mot}}+\mathcal{Z}(e^l_{\mathrm{mot}}),\;e_{\mathrm{sub}}+\mathcal{Z}(e^l_{\mathrm{sub}})]$
Embedding Refiner	Fuses subject and motion for joint alignment	$e' = e + \mathcal{Z}(\mathcal{R}(e))$
Visual Motion Adapter	Parameter-efficient visual adaptation for motion fidelity	$\tilde{\mathbf{W}}_* = \mathbf{W}_* + \mathbf{B}_\mathbf{A}_$
Embedding-Specific Training	Alternately specializes on subject or motion generalization	$\alpha$ -controlled sampling, SPV regularization
MotionBench	Validates motion class transfer, subject diversity, fidelity	16 motions × 6–10 exemplars per class

7. Impact and Significance

The introduction of deeply structured, disentangled semantic motion classes in SynMotion establishes a new paradigm for motion transfer in video generation. By simultaneously supporting subject generalization and motion specificity, the model achieves controllable synthesis across arbitrary subject-motion combinations, facilitating applications in creative media, content editing, and AI-driven video design. The rigorous evaluation on MotionBench offers strong empirical evidence that explicit semantic motion class separation—implemented as dual embeddings and supported by efficient adapters and regularized training—is essential for state-of-the-art performance in customizable video motion generation.

PDF Markdown Chat (Pro)

References (1)

SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Semantic Motion Classes.

Semantic Motion Classes for Video Synthesis

1. Formalization and Role of Semantic Motion Classes

2. Dual-Embedding Mechanism for Disentanglement

3. Learning Customized Motion Features

4. Integration of Semantic Guidance with Visual Adaptation

5. Mathematical Characterization and Objective

6. Benchmarking and Evaluation with MotionBench

Summary Table: SynMotion's Semantic Motion Class Framework

7. Impact and Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Semantic Motion Classes for Video Synthesis

1. Formalization and Role of Semantic Motion Classes

2. Dual-Embedding Mechanism for Disentanglement

3. Learning Customized Motion Features

4. Integration of Semantic Guidance with Visual Adaptation

5. Mathematical Characterization and Objective

6. Benchmarking and Evaluation with MotionBench

Summary Table: SynMotion's Semantic Motion Class Framework

7. Impact and Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research