Motion Transfer: Techniques and Applications

Updated 3 July 2026

Motion transfer is the process of extracting motion patterns from a source video and applying them to a target subject while preserving inherent appearance characteristics.
Modern systems leverage attention-based motion fields, optical flow, and dense feature correspondences to ensure smooth temporal dynamics and context-aware adaptation.
Integrated into video synthesis pipelines, motion transfer enables precise motion replication and advanced applications in animation and data augmentation.

Motion transfer (MT) refers to the process of extracting a motion pattern from a reference video or input, decoupling it from the source’s appearance, and reapplying it—potentially with adaptation—to generate a new video or sequence where a target subject or object performs that motion while preserving its own appearance. MT is a foundational capability in controllable video synthesis, video-driven animation, and advanced data augmentation pipelines. Modern MT systems leverage explicit or implicit motion representations, sophisticated attention mechanisms, and content-adaptive mappings to robustly transfer highly nontrivial, context-dependent motions across substantial semantic gaps.

1. Fundamental Principles of Motion Transfer

The central goal of motion transfer is to apply the temporal dynamics of a reference (the “source”)—such as the joint motion in a human dance clip or articulated object deformation—to a target (often distinct in semantics, appearance, or structure), producing new content whose appearance is governed by the target but whose dynamics follow the reference motion. Key principles are:

Motion–Appearance Disentanglement: Rigid separation of motion (what changes over time) from static or slowly-varying appearance (color, geometry, texture).
Content-Aware Adaptation: The transferred motion must be semantically compatible with the target; direct application of reference motion fields can result in geometric or semantic artifacts if not adapted to the target’s structure.
Temporal Consistency: Generated sequences must exhibit temporally smooth, physically plausible motion trajectories to avoid flicker or artefacts.
Editability and Compositionality: Explicit representations such as attention-derived motion fields, warps, or velocity fields enable targeted editing, mixing, or inversion of motion patterns.

In recent text-to-video diffusion transformer frameworks, these principles are operationalized by directly manipulating cross-frame attention distributions, warping motion fields via learned correspondences, or constructing explicit velocity priors in a unified latent space (Zhang et al., 5 Jan 2026).

2. Motion Extraction and Representation

State-of-the-art MT frameworks adopt various strategies to extract motion signals:

Attention-Based Motion Fields: In transformer-based diffusion models, 3D spatio-temporal self-attention maps are parsed by slicing out the cross-frame subblocks (excluding intra-frame components tied to appearance). For a given attention tensor $A^{(\ell)}_{n\to m}$ at layer $\ell$ , the principal motion per spatial position is computed as the vector from a pixel’s location in one frame to the attention-weighted mean location in another frame. Collecting these for all pixels yields a dense 2D motion field $M^{(\ell)}_{n\to m}$ (Zhang et al., 5 Jan 2026).
Optical Flow and Patch Trajectories: Classical and recent optical-flow models (e.g., GMFlow) are used to build framewise displacement maps, which are then downsampled to the latent space. Patchwise trajectories are constructed by integrating these flows, producing global or per-object motion priors (Teodoro et al., 1 Apr 2026).
Dense Correspondence and Shape Warping: For objects with shape or structure variations across domains, motion transfer involves first aligning object parts semantically or morphologically using feature-matching (via DINO features, U-Net embeddings), followed by explicit warps such as Thin Plate Spline (TPS) mappings (Liu et al., 22 Jul 2025).

Explicit selection of suitable layers and time steps for attention-based motion extraction is important; empirically, late layers but early diffusion steps often give optimal alignment to ground-truth flow (Zhang et al., 5 Jan 2026).

3. Content-Aware Motion Adaptation

Naively, transferring reference motion directly to a target with different semantic content can produce unsatisfactory results. Motion transfer frameworks employ content-aware customization:

Foreground–Background Decomposition: Segmentation masks (e.g., via Lang-SAM) allow separating the reference motion field into foreground (object) and background components, which can be adapted independently to better suit the target's geometry.
Dense Feature Correspondence: DINO feature extractors (or other deep visual backbones) are used to build high-dimensional feature maps of reference and target frames. Spatial correspondences are then established (e.g., by nearest-neighbor or via the Hungarian algorithm for bijective assignment), permitting finer-grained warping of the reference motion field to the target (Zhang et al., 5 Jan 2026, Liu et al., 22 Jul 2025).
Morphological Adaption: Global similarity transforms and TPS warping are used to retarget and deform reference motions wherever necessary to match target part proportions or topology (Liu et al., 22 Jul 2025).
Smoothing and Inpainting: Post-warp inpainting (to handle undefined or occluded regions) and Gaussian or nearest-neighbor smoothing alleviate spatial artifacts introduced during the adaptation process.

The customization pipeline can be formalized as $M_\mathrm{final}\,=\,f_\mathrm{custom}(M_\mathrm{ref},\,C)$ , with $C$ describing the dense correspondence field.

4. Integration into Video Generation and Diffusion Models

Motion-adapted fields or priors are injected into the generative process via additional constraint terms or guidance steps in the diffusion loop. Notable formulations include:

Gradient-Based Motion Loss: At each (early) diffusion step $t$ in the DiT denoising process, an extra motion discrepancy term is minimized:

$z_t \leftarrow z_t - \eta\,\nabla_{z_t}\,\|\,M_\mathrm{tgt}(z_t) - M_\mathrm{final}\|_2^2$

where $M_\mathrm{tgt}$ is the motion derived from the current latent, and $M_\mathrm{final}$ is the customized motion field. The updated $z_t^*$ is then used for the subsequent denoising step (Zhang et al., 5 Jan 2026).

Temporal Attention Guidance: In diffusion U-Nets, temporal attention maps can be sparsified and enforced via auxiliary energy terms that penalize deviation from the reference attention structure, ensuring the transferred motion is realized at the attention level (Liu et al., 22 Jul 2025).
Explicit Flow-Based Priors: Motion priors derived from source optical flow are imposed as soft constraints on the model’s own prediction of motion, using loss terms such as

$\ell$ 0

with DISP computed from the attention-weighted displacement between patches in successive frames (Teodoro et al., 1 Apr 2026).

Attention Masking for Multi-Object Multi-Motion: For compositional scenarios, object-specific motion and text tokens are constrained to their spatial regions in the transformer via mask-based softmax gating of attention connections, enforced by custom mask-propagation mechanisms (Li et al., 1 Mar 2026).

These mechanisms operate without modifying the pretrained transformer backbone parameters; instead, they adapt latents or attention masks at inference.

5. Disentanglement and Appearance Preservation

Preserving appearance while transferring only motion is nontrivial, particularly in transformer-based and latent-diffusion settings. Frameworks ensure appearance–motion disentanglement by:

Extracting motion exclusively from cross-frame attention (not intra-frame), omitting or zeroing appearance tokens during motion field computation (Zhang et al., 5 Jan 2026).
Regularizing the motion learning process to prevent leakage of appearance cues, such as appearance injection modules that isolate spatial (appearance) encoding from temporal (motion) pathways (Li et al., 2024).
Employing separate learning phases: appearance modeling (driven by expanded prompts or enhanced spatial attention) and motion modeling (conditioned on motion-specific representations, with appearance fixed) (Li et al., 2024).
Cycle-consistency, global and patch-based adversarial, and perceptual losses further enforce that synthesized outputs maintain target appearance while correctly integrating the transferred motion (Xu et al., 2022).

Maintaining this disentanglement is critical for high motion fidelity without appearance drift or “source-bleed.”

6. Evaluation Protocols and Empirical Results

Recent MT benchmarks evaluate:

Motion Fidelity (MF): Quantitative alignment of generated and reference motion trajectories (often via CLIP-based comparisons, optical flow matching, or trajectory distances).
Temporal Consistency: E.g., CoTracker/intra-frame LPIPS, measuring the smoothness of generated motion.
Semantic/Textual Alignment: Via CLIPScore or prompt-image similarity, ensuring that the generated sequence content matches the target prompt.
Ablation and User Studies: Removal of any module (motion extraction, adaptation/refinement, attention masking) in content-aware pipelines leads to consistent drops in MF and user preference, underscoring each module’s necessity (Zhang et al., 5 Jan 2026, Liu et al., 22 Jul 2025, Teodoro et al., 1 Apr 2026).

Comparative experiments on multi-prompt, multi-difficulty benchmarks (e.g., DAVIS, MTBench) establish that explicit, content-aware, attention-based pipelines such as MotionAdapter (Zhang et al., 5 Jan 2026), MotionShot (Liu et al., 22 Jul 2025), and MotionGrounder (Teodoro et al., 1 Apr 2026) produce state-of-the-art results in both standard and challenging semantic-gap scenarios.

7. Extensions, Limitations, and Future Directions

Extensions:

Complex Editing: Direct manipulation of explicit motion fields enables operators such as scaling (for zoom), multi-reference merging (blend/fusion of multiple reference motions), and robust handling of occlusions (Zhang et al., 5 Jan 2026).
Multi-Object, Multi-Motion Transfer: MDMA-style attention masking and mask-propagation allow compositional animation of scenes containing independent, simultaneously-moving objects (Li et al., 1 Mar 2026, Teodoro et al., 1 Apr 2026).
Application-Specific Adaptations: Variants have been developed for biometrics (Huang et al., 2024), physiological video augmentation (Paruchuri et al., 2023), frequency-domain stabilization (Yang et al., 2022), and cross-domain category transfer (e.g., animal motion with habit preservation (Zhang et al., 10 Jul 2025)).

Limitations:

Feature Correspondence Accuracy: Reliance on DINO or other feature-based matching introduces sensitivity to failure modes in correspondence.
Segmentation Mask Quality: Imperfect foreground–background separation can degrade adaptation, especially in occlusion-rich scenes.
Long-Term Coherence: Current systems often target short clips; scaling to high-resolution, long-duration, or highly nonrigid motion sequences remains an open challenge.

Future work directions include development of learned correspondence modules, joint mask and motion optimization, hierarchical or windowed diffusion to support long videos, domain generalization, and integration of 3D structure or affordance priors to improve cross-category robustness (Zhang et al., 5 Jan 2026, Li et al., 2024, Teodoro et al., 1 Apr 2026).

Key Citations: