MotionPair-60K: Synthetic Motion Transfer Dataset
- MotionPair-60K is a large-scale, synthetic dataset designed to enable end-to-end motion transfer for character animation and environment replacement.
- It utilizes an agentic synthetic loop generation process to create paired video sequences, eliminating the need for explicit pose extraction and background masking.
- The dataset supports unified learning of cross-identity motion imitation and environment transfer, with performance validated using external real-video benchmarks.
MotionPair-60K is a large-scale, synthetic, end-to-end motion-transfer dataset designed for training and evaluating deep models in heterogeneous character animation and environment replacement tasks. Developed in the context of SCAIL-2, it is distinguished by its avoidance of explicit intermediate pose skeletons or masked backgrounds, facilitating direct visual information transfer between video sequences. The dataset enables unified learning of motion imitation (cross-identity transfer) and environment transfer (replacement) without reliance on post hoc skeleton or background extraction, thereby supporting end-to-end modeling of controlled character animation (Yan et al., 9 Jun 2026).
1. Dataset Composition
MotionPair-60K comprises 59,376 paired video sequences, systematically stratified into two primary modes corresponding to distinct animation sub-tasks:
| Mode | Pairs | Source Generators |
|---|---|---|
| Animation | 45,742 | SCAIL, Wan-Animate |
| Replacement | 13,634 | MoCha |
- Animation Mode (≈77%): Encompasses cross-identity motion imitation, subdivided into 31,895 SCAIL-generated single-character animation pairs (targeting large shape gaps and complex motions) and 13,847 Wan-Animate-generated pairs (optimized for close-ups and slow motion).
- Replacement Mode (≈23%): Encodes environment transfer, partitioned into MoCha single-character (9,249) and MoCha multi-character (4,385) replacement pairs.
Additionally, during model training, an auxiliary pool of approximately 100,000 pose-driven pairs—formatted in SCAIL’s public pose schema—augments pretraining diversity.
No explicit train/valid/test split is defined within MotionPair-60K; the complete set serves pretraining, with evaluation conducted using external real-video benchmarks such as Studio-Bench and X-Dance.
2. Data Generation Pipeline
Creation of each pair in MotionPair-60K follows a four-stage "agentic synthetic loop":
- Driving Video Sampling: Real-world driving clips are randomly selected from extensive internal and HUMO datasets.
- Reference Candidate Generation:
- A Candidate Selector identifies reference character crops compatible with ’s action type.
- Prompt Weaver generates text prompts encoding scene and posture.
- Image Model (; e.g., Nano-Banana via Gemini API) synthesizes environment-matched character renders, leveraging the first frame of for posture alignment.
- Quality Checker applies CLIP- and VLM-guided filters and, if necessary, local environment editing to suppress background-foreground leakage.
- Animation / Replacement Synthesis:
- Animation Mode: Generator (SCAIL) or (Wan-Animate) animates the reference within the driving environment , producing .
- Replacement Mode: Generator 0 (MoCha) inpaints the driving background using the reference character, generating single- or multi-character replacements via 1.
- Reverse-Driving Pair Assembly:
- Each synthetic 2 is repurposed as the "driving" input, paired with the original real video 3 as the supervised target.
- A reference frame 4 is sampled from 5 for maximal realism.
- The final triplet is 6.
Sampling ratios in pretraining are approximately 60% Animation Mode, 20% Replacement Mode, and 20% auxiliary pose-driven pairs. Mathematically, with data sources 7 and mixture weights 8, training pairs obey 9 and 0.
3. Conditional Structures for Unified Modeling
To permit joint learning across animation and replacement, SCAIL-2 attaches two complementary conditional signals to VAE latents:
- In-Context Mask Conditioning:
- An "environment switch" channel (1) encodes pixelwise provenance (reference image vs. driving sequence).
- Six additional "binding slot" channels (2) group driving characters by identity, determined by a binding map 3. For each driving character 4, 5 indicates mask membership and identity binding.
- Masks are produced by SAM3 per frame, downsampled, and temporally concatenated, yielding 6 extra latent channels.
- Mode-Specific Shifted RoPE:
- Each VAE token in 7 receives triple indices 8.
- Animation Mode: 9 tokens have 0; 1 use 2, with 3 spatially offset.
- Replacement Mode: 4 receives a shift 5 along 6, with 7 sharing 8 layout (no temporal gap).
- This conditional scheme disentangles conflicting optimizations across animation and environment weaving in positional embeddings.
4. Data Modalities and Formats
MotionPair-60K supports multiple modalities, processed for efficient modeling:
- Raw video: MP4 or frame folders, 9 px, 24–30 fps, 8–16 frames per sequence.
- VAE latents: Floating point tensors with shape 0, commonly 1, 2 for Wan2.1-14B-I2V.
- Segmentation masks: Per-frame masks (SAM3 output), stored as PNG, downsampled to 3 binary maps.
- Auxiliary pose data: JSON or NPZ files containing 2D keypoints (17 joints) or body-mesh parameters as needed for pose-driven training augmentation.
5. Dataset Characteristics and Controls
| Category | Details |
|---|---|
| Sub-task Dist. | Animation Mode ≈77%; Replacement Mode ≈23% |
| Replacement | Single-/Multi-character: 68% / 32% (within Replacement Mode) |
| Diversity | Age, clothing, silhouette (internal + HUMO character pool) |
| Motions | Sports, dance, object interaction, multi-person scenes |
| Environments | Indoor studios, outdoor streets, sports arenas |
- Quality Control: Gemini VLM filters exclude <4 of low-quality synthetics; multi-turn prompt editing and background environment masking further enhance coherence.
- Benchmarking: No per-pair realism metrics are provided for MotionPair-60K itself; instead, downstream impact is measured using external sets (Studio-Bench, X-Dance), where SCAIL-2-trained models exhibit generalization to real cross-identity tasks, achieving top human-evaluation scores (Yan et al., 9 Jun 2026).
6. Significance and Context
MotionPair-60K represents the first synthetic motion-transfer dataset of its scale to unify heterogeneous, end-to-end character animation tasks—spanning both single-/multi-character animation and environment replacement—while eschewing reliance on explicit pose skeletons or masked backgrounds as intermediates. Its agentic synthetic loop generation process, balanced sub-task stratification, and decoupled conditional injection advance the feasibility of unified diffusion modeling for animation and scene transfer workflows.
A plausible implication is that such heterogeneous, visually grounded datasets facilitate training of models capable of generalizing to real-world cross-identity motion transfer without the representational bottlenecks imposed by skeletonization or mask-based preprocessing, thereby supporting progress toward fully end-to-end video-based animation control (Yan et al., 9 Jun 2026).