Papers
Topics
Authors
Recent
Search
2000 character limit reached

MotionPair-60K: Synthetic Motion Transfer Dataset

Updated 1 July 2026
  • MotionPair-60K is a large-scale, synthetic dataset designed to enable end-to-end motion transfer for character animation and environment replacement.
  • It utilizes an agentic synthetic loop generation process to create paired video sequences, eliminating the need for explicit pose extraction and background masking.
  • The dataset supports unified learning of cross-identity motion imitation and environment transfer, with performance validated using external real-video benchmarks.

MotionPair-60K is a large-scale, synthetic, end-to-end motion-transfer dataset designed for training and evaluating deep models in heterogeneous character animation and environment replacement tasks. Developed in the context of SCAIL-2, it is distinguished by its avoidance of explicit intermediate pose skeletons or masked backgrounds, facilitating direct visual information transfer between video sequences. The dataset enables unified learning of motion imitation (cross-identity transfer) and environment transfer (replacement) without reliance on post hoc skeleton or background extraction, thereby supporting end-to-end modeling of controlled character animation (Yan et al., 9 Jun 2026).

1. Dataset Composition

MotionPair-60K comprises 59,376 paired video sequences, systematically stratified into two primary modes corresponding to distinct animation sub-tasks:

Mode Pairs Source Generators
Animation 45,742 SCAIL, Wan-Animate
Replacement 13,634 MoCha
  • Animation Mode (≈77%): Encompasses cross-identity motion imitation, subdivided into 31,895 SCAIL-generated single-character animation pairs (targeting large shape gaps and complex motions) and 13,847 Wan-Animate-generated pairs (optimized for close-ups and slow motion).
  • Replacement Mode (≈23%): Encodes environment transfer, partitioned into MoCha single-character (9,249) and MoCha multi-character (4,385) replacement pairs.

Additionally, during model training, an auxiliary pool of approximately 100,000 pose-driven pairs—formatted in SCAIL’s public pose schema—augments pretraining diversity.

No explicit train/valid/test split is defined within MotionPair-60K; the complete set serves pretraining, with evaluation conducted using external real-video benchmarks such as Studio-Bench and X-Dance.

2. Data Generation Pipeline

Creation of each pair in MotionPair-60K follows a four-stage "agentic synthetic loop":

  1. Driving Video Sampling: Real-world driving clips yy are randomly selected from extensive internal and HUMO datasets.
  2. Reference Candidate Generation:
    • A Candidate Selector identifies reference character crops {I1,…,In}\{I_1,\ldots,I_n\} compatible with yy’s action type.
    • Prompt Weaver generates text prompts encoding scene and posture.
    • Image Model (MM; e.g., Nano-Banana via Gemini API) synthesizes environment-matched character renders, leveraging the first frame of yy for posture alignment.
    • Quality Checker applies CLIP- and VLM-guided filters and, if necessary, local environment editing to suppress background-foreground leakage.
  3. Animation / Replacement Synthesis:
    • Animation Mode: Generator G1G_1 (SCAIL) or G2G_2 (Wan-Animate) animates the reference II within the driving environment yy, producing y~=GA(y,I)\tilde{y} = G_A(y, I).
    • Replacement Mode: Generator {I1,…,In}\{I_1,\ldots,I_n\}0 (MoCha) inpaints the driving background using the reference character, generating single- or multi-character replacements via {I1,…,In}\{I_1,\ldots,I_n\}1.
  4. Reverse-Driving Pair Assembly:
    • Each synthetic {I1,…,In}\{I_1,\ldots,I_n\}2 is repurposed as the "driving" input, paired with the original real video {I1,…,In}\{I_1,\ldots,I_n\}3 as the supervised target.
    • A reference frame {I1,…,In}\{I_1,\ldots,I_n\}4 is sampled from {I1,…,In}\{I_1,\ldots,I_n\}5 for maximal realism.
    • The final triplet is {I1,…,In}\{I_1,\ldots,I_n\}6.

Sampling ratios in pretraining are approximately 60% Animation Mode, 20% Replacement Mode, and 20% auxiliary pose-driven pairs. Mathematically, with data sources {I1,…,In}\{I_1,\ldots,I_n\}7 and mixture weights {I1,…,In}\{I_1,\ldots,I_n\}8, training pairs obey {I1,…,In}\{I_1,\ldots,I_n\}9 and yy0.

3. Conditional Structures for Unified Modeling

To permit joint learning across animation and replacement, SCAIL-2 attaches two complementary conditional signals to VAE latents:

  • In-Context Mask Conditioning:
    • An "environment switch" channel (yy1) encodes pixelwise provenance (reference image vs. driving sequence).
    • Six additional "binding slot" channels (yy2) group driving characters by identity, determined by a binding map yy3. For each driving character yy4, yy5 indicates mask membership and identity binding.
    • Masks are produced by SAM3 per frame, downsampled, and temporally concatenated, yielding yy6 extra latent channels.
  • Mode-Specific Shifted RoPE:
    • Each VAE token in yy7 receives triple indices yy8.
    • Animation Mode: yy9 tokens have MM0; MM1 use MM2, with MM3 spatially offset.
    • Replacement Mode: MM4 receives a shift MM5 along MM6, with MM7 sharing MM8 layout (no temporal gap).
    • This conditional scheme disentangles conflicting optimizations across animation and environment weaving in positional embeddings.

4. Data Modalities and Formats

MotionPair-60K supports multiple modalities, processed for efficient modeling:

  • Raw video: MP4 or frame folders, MM9 px, 24–30 fps, 8–16 frames per sequence.
  • VAE latents: Floating point tensors with shape yy0, commonly yy1, yy2 for Wan2.1-14B-I2V.
  • Segmentation masks: Per-frame masks (SAM3 output), stored as PNG, downsampled to yy3 binary maps.
  • Auxiliary pose data: JSON or NPZ files containing 2D keypoints (17 joints) or body-mesh parameters as needed for pose-driven training augmentation.

5. Dataset Characteristics and Controls

Category Details
Sub-task Dist. Animation Mode ≈77%; Replacement Mode ≈23%
Replacement Single-/Multi-character: 68% / 32% (within Replacement Mode)
Diversity Age, clothing, silhouette (internal + HUMO character pool)
Motions Sports, dance, object interaction, multi-person scenes
Environments Indoor studios, outdoor streets, sports arenas
  • Quality Control: Gemini VLM filters exclude <yy4 of low-quality synthetics; multi-turn prompt editing and background environment masking further enhance coherence.
  • Benchmarking: No per-pair realism metrics are provided for MotionPair-60K itself; instead, downstream impact is measured using external sets (Studio-Bench, X-Dance), where SCAIL-2-trained models exhibit generalization to real cross-identity tasks, achieving top human-evaluation scores (Yan et al., 9 Jun 2026).

6. Significance and Context

MotionPair-60K represents the first synthetic motion-transfer dataset of its scale to unify heterogeneous, end-to-end character animation tasks—spanning both single-/multi-character animation and environment replacement—while eschewing reliance on explicit pose skeletons or masked backgrounds as intermediates. Its agentic synthetic loop generation process, balanced sub-task stratification, and decoupled conditional injection advance the feasibility of unified diffusion modeling for animation and scene transfer workflows.

A plausible implication is that such heterogeneous, visually grounded datasets facilitate training of models capable of generalizing to real-world cross-identity motion transfer without the representational bottlenecks imposed by skeletonization or mask-based preprocessing, thereby supporting progress toward fully end-to-end video-based animation control (Yan et al., 9 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionPair-60K.