MotionPair-60K: Synthetic Motion Transfer Dataset

Updated 1 July 2026

MotionPair-60K is a large-scale, synthetic dataset designed to enable end-to-end motion transfer for character animation and environment replacement.
It utilizes an agentic synthetic loop generation process to create paired video sequences, eliminating the need for explicit pose extraction and background masking.
The dataset supports unified learning of cross-identity motion imitation and environment transfer, with performance validated using external real-video benchmarks.

MotionPair-60K is a large-scale, synthetic, end-to-end motion-transfer dataset designed for training and evaluating deep models in heterogeneous character animation and environment replacement tasks. Developed in the context of SCAIL-2, it is distinguished by its avoidance of explicit intermediate pose skeletons or masked backgrounds, facilitating direct visual information transfer between video sequences. The dataset enables unified learning of motion imitation (cross-identity transfer) and environment transfer (replacement) without reliance on post hoc skeleton or background extraction, thereby supporting end-to-end modeling of controlled character animation (Yan et al., 9 Jun 2026).

1. Dataset Composition

MotionPair-60K comprises 59,376 paired video sequences, systematically stratified into two primary modes corresponding to distinct animation sub-tasks:

Mode	Pairs	Source Generators
Animation	45,742	SCAIL, Wan-Animate
Replacement	13,634	MoCha

Animation Mode (≈77%): Encompasses cross-identity motion imitation, subdivided into 31,895 SCAIL-generated single-character animation pairs (targeting large shape gaps and complex motions) and 13,847 Wan-Animate-generated pairs (optimized for close-ups and slow motion).
Replacement Mode (≈23%): Encodes environment transfer, partitioned into MoCha single-character (9,249) and MoCha multi-character (4,385) replacement pairs.

Additionally, during model training, an auxiliary pool of approximately 100,000 pose-driven pairs—formatted in SCAIL’s public pose schema—augments pretraining diversity.

No explicit train/valid/test split is defined within MotionPair-60K; the complete set serves pretraining, with evaluation conducted using external real-video benchmarks such as Studio-Bench and X-Dance.

2. Data Generation Pipeline

Creation of each pair in MotionPair-60K follows a four-stage "agentic synthetic loop":

Driving Video Sampling: Real-world driving clips $y$ are randomly selected from extensive internal and HUMO datasets.
Reference Candidate Generation:
- A Candidate Selector identifies reference character crops $\{I_1,\ldots,I_n\}$ compatible with $y$ ’s action type.
- Prompt Weaver generates text prompts encoding scene and posture.
- Image Model ( $M$ ; e.g., Nano-Banana via Gemini API) synthesizes environment-matched character renders, leveraging the first frame of $y$ for posture alignment.
- Quality Checker applies CLIP- and VLM-guided filters and, if necessary, local environment editing to suppress background-foreground leakage.
Animation / Replacement Synthesis:
- Animation Mode: Generator $G_1$ (SCAIL) or $G_2$ (Wan-Animate) animates the reference $I$ within the driving environment $y$ , producing $\tilde{y} = G_A(y, I)$ .
- Replacement Mode: Generator $\{I_1,\ldots,I_n\}$ 0 (MoCha) inpaints the driving background using the reference character, generating single- or multi-character replacements via $\{I_1,\ldots,I_n\}$ 1.
Reverse-Driving Pair Assembly:
- Each synthetic $\{I_1,\ldots,I_n\}$ 2 is repurposed as the "driving" input, paired with the original real video $\{I_1,\ldots,I_n\}$ 3 as the supervised target.
- A reference frame $\{I_1,\ldots,I_n\}$ 4 is sampled from $\{I_1,\ldots,I_n\}$ 5 for maximal realism.
- The final triplet is $\{I_1,\ldots,I_n\}$ 6.

Sampling ratios in pretraining are approximately 60% Animation Mode, 20% Replacement Mode, and 20% auxiliary pose-driven pairs. Mathematically, with data sources $\{I_1,\ldots,I_n\}$ 7 and mixture weights $\{I_1,\ldots,I_n\}$ 8, training pairs obey $\{I_1,\ldots,I_n\}$ 9 and $y$ 0.

3. Conditional Structures for Unified Modeling

To permit joint learning across animation and replacement, SCAIL-2 attaches two complementary conditional signals to VAE latents:

In-Context Mask Conditioning:
- An "environment switch" channel ( $y$ 1) encodes pixelwise provenance (reference image vs. driving sequence).
- Six additional "binding slot" channels ( $y$ 2) group driving characters by identity, determined by a binding map $y$ 3. For each driving character $y$ 4, $y$ 5 indicates mask membership and identity binding.
- Masks are produced by SAM3 per frame, downsampled, and temporally concatenated, yielding $y$ 6 extra latent channels.
Mode-Specific Shifted RoPE:
- Each VAE token in $y$ 7 receives triple indices $y$ 8.
- Animation Mode: $y$ 9 tokens have $M$ 0; $M$ 1 use $M$ 2, with $M$ 3 spatially offset.
- Replacement Mode: $M$ 4 receives a shift $M$ 5 along $M$ 6, with $M$ 7 sharing $M$ 8 layout (no temporal gap).
- This conditional scheme disentangles conflicting optimizations across animation and environment weaving in positional embeddings.

4. Data Modalities and Formats

MotionPair-60K supports multiple modalities, processed for efficient modeling:

Raw video: MP4 or frame folders, $M$ 9 px, 24–30 fps, 8–16 frames per sequence.
VAE latents: Floating point tensors with shape $y$ 0, commonly $y$ 1, $y$ 2 for Wan2.1-14B-I2V.
Segmentation masks: Per-frame masks (SAM3 output), stored as PNG, downsampled to $y$ 3 binary maps.
Auxiliary pose data: JSON or NPZ files containing 2D keypoints (17 joints) or body-mesh parameters as needed for pose-driven training augmentation.

5. Dataset Characteristics and Controls

Category	Details
Sub-task Dist.	Animation Mode ≈77%; Replacement Mode ≈23%
Replacement	Single-/Multi-character: 68% / 32% (within Replacement Mode)
Diversity	Age, clothing, silhouette (internal + HUMO character pool)
Motions	Sports, dance, object interaction, multi-person scenes
Environments	Indoor studios, outdoor streets, sports arenas

Quality Control: Gemini VLM filters exclude < $y$ 4 of low-quality synthetics; multi-turn prompt editing and background environment masking further enhance coherence.
Benchmarking: No per-pair realism metrics are provided for MotionPair-60K itself; instead, downstream impact is measured using external sets (Studio-Bench, X-Dance), where SCAIL-2-trained models exhibit generalization to real cross-identity tasks, achieving top human-evaluation scores (Yan et al., 9 Jun 2026).

6. Significance and Context

MotionPair-60K represents the first synthetic motion-transfer dataset of its scale to unify heterogeneous, end-to-end character animation tasks—spanning both single-/multi-character animation and environment replacement—while eschewing reliance on explicit pose skeletons or masked backgrounds as intermediates. Its agentic synthetic loop generation process, balanced sub-task stratification, and decoupled conditional injection advance the feasibility of unified diffusion modeling for animation and scene transfer workflows.

A plausible implication is that such heterogeneous, visually grounded datasets facilitate training of models capable of generalizing to real-world cross-identity motion transfer without the representational bottlenecks imposed by skeletonization or mask-based preprocessing, thereby supporting progress toward fully end-to-end video-based animation control (Yan et al., 9 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionPair-60K.

MotionPair-60K: Synthetic Motion Transfer Dataset

1. Dataset Composition

2. Data Generation Pipeline

3. Conditional Structures for Unified Modeling

4. Data Modalities and Formats

5. Dataset Characteristics and Controls

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MotionPair-60K: Synthetic Motion Transfer Dataset

1. Dataset Composition

2. Data Generation Pipeline

3. Conditional Structures for Unified Modeling

4. Data Modalities and Formats

5. Dataset Characteristics and Controls

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research