MotionPair-60K: End-to-End Motion Dataset
- MotionPair-60K is a large-scale dataset offering raw paired video triads for end-to-end motion transfer without relying on intermediate skeletons or masks.
- It unifies diverse animation tasks—including single and multi-character animation and replacement—through soft-guidance signals and in-context conditioning.
- The dataset supports robust benchmarking and bias-mitigated modeling, advancing scalable image-to-video generative research in controlled character animation.
MotionPair-60K is a large-scale, end-to-end motion transfer dataset designed to advance research in controlled character animation without reliance on intermediate representations such as skeletons or background masks. Unlike legacy datasets that depend on pose or mask-based parsing—thereby discarding complex visual cues—MotionPair-60K provides paired driving/reference/target video triads spanning heterogeneous animation tasks, with all visual information preserved and soft-guidance signals supporting multi-task training. The dataset was introduced in the context of the SCAIL-2 framework for unified, in-context-conditioned animation, aiming to address fragmentation across sub-tasks and to support robust, scalable, and bias-mitigated modeling for both single- and multi-character motion transfer (Yan et al., 9 Jun 2026).
1. Motivation and Design Rationale
Traditional character animation pipelines rely on intermediate skeleton representations or masked backgrounds, resulting in significant loss of visual structure related to occlusions, object interactions, and environmental context. Additionally, prior video diffusion approaches that operate in an end-to-end manner require large-scale paired data—that is, videos of the same motion performed by different characters in possibly disparate settings—which have been lacking due to the compositional complexity of animation tasks.
MotionPair-60K was thus curated to fulfill several key objectives:
- Enable paired, end-to-end learning where the network sees all raw visual input and generates the full output video, with no explicit intermediate mask or skeleton as the target.
- Unify the training of single-character animation, multi-character animation, and character/environment replacement tasks under a single, mask-guided conditioning scheme using “soft guidance” (in-context masks and mode-specific RoPE).
- Provide scale, task diversity, and annotation fidelity sufficient for benchmarking and advancing I2V (image-to-video) generative models in broad character animation scenarios.
- Facilitate post-training bias mitigation, especially for synthetic artifacts in fine articulated regions, via a dedicated “Bias-Aware DPO” protocol (Yan et al., 9 Jun 2026).
2. Data Synthesis Pipeline
The assembly of MotionPair-60K follows an agentic synthetic loop structured around high-fidelity pair and triplet generation. Central steps include:
- Animation Pair Synthesis: Real driving videos are sampled from large human-centric video pools. Candidate character images are selected to match the pose distribution of the driving sequence, and a prompt-driven generator (e.g., Gemini API) fabricates reference images congruent with the target character’s anticipated style and posture. Pose-driven animation generators (SCAIL; Wan-Animate for close-ups) are employed to create the reanimated video, which is then quality-checked via a lightweight vision-LLM (reject rate < 30%).
- Replacement Generation: For character replacement tasks, the MoCha renderer generates both single- and multi-character replacement pairs, manipulating the environment and identity bindings to exercise occlusion and environmental inference.
- Reverse Driving: Triplets are constructed by reversing the animation process. Given a real sequence and its synthetic reanimation, the triplet (driving, reference, target) enables supervision where the driving input transmits only motion, and the target encompasses all environmental realism and object interactions.
Key underlying generative operators are defined as:
- Forward diffusion in VAE latent space:
- Denoising loss:
- Core animation generator:
3. Dataset Composition and Statistics
MotionPair-60K comprises 59,376 end-to-end motion transfer pairs. The distribution of samples by generator/model and task is as follows:
| Task Type | Generator | Pair Count | Percentage |
|---|---|---|---|
| Single-character animation | SCAIL | 31,895 | 53.7% |
| Single-character animation | Wan-Animate | 13,847 | 23.3% |
| Single-character replacement | MoCha | 9,249 | 15.6% |
| Multi-character replacement | MoCha | 4,385 | 7.4% |
- Training sampling ratios: 60% end-to-end animation, 20% replacement, 20% additional pose-driven skeleton/video pairs (~100,000 samples; note these are for pretraining only).
- Each video is stored at 512×512 pixel resolution, typically with 16–32 frames.
- All videos are encoded with a pretrained VAE, mapping each frame to a latent grid (commonly 16×16 per spatial axis), facilitating efficient downstream modeling.
4. Data Structure, Modalities, and Annotation
MotionPair-60K provides both raw visual data and soft-guidance annotation:
- Modalities:
- Driving video (input, mp4)
- Target video (output, mp4)
- Reference image (single frame, png)
- In-context masks: environment switch mask plus character-binding masks (npy/npz, stacked)
- Optional pose-skeleton annotation (json) for pose-driven samples
- Mask Encoding:
- 6 binding slots and 1 environment channel, yielding mask channels after spatiotemporal aggregation.
- Masks are spatially downsampled to match VAE latent resolution and stacked along channel/time axes.
- Metadata (metadata.json):
- Task type (Animation/Replacement)
- Generator used (“SCAIL”, “Wan-Animate”, “MoCha”)
- Binding map (driving to reference character ID mapping)
- Source of environment
- Frame count, resolution, sample ID
5. Splits, Cross-Domain Generalization, and Evaluation Protocol
The recommended split is 80% training (~47,500), 10% validation (~5,900), and 10% test (~5,900). The test set is curated to contain held-out character identities, novel backgrounds, and multi-character interactions to robustly evaluate cross-domain and cross-identity generalization.
Task domains incorporate:
- Diverse driving activities (human, prop interactions, select non-human/cartoon references)
- Replacement scenarios demanding integration of characters and backgrounds from disparate scenes
A plausible implication is that the dataset structure and split strategies encourage robust model evaluation on out-of-distribution animation phenomena.
6. Licensing, Use Cases, and Known Limitations
MotionPair-60K—along with SCAIL-2 model weights—is to be released under a CC-BY-NC 4.0 license. Major envisioned uses include:
- Large-scale benchmarking for end-to-end character animation, especially with cross-identity motion transfer and environment recomposition
- Production workflows for film/game where reanimation or character replacement is desired without intermediate parsing
- Testing and development of in-context video conditioning or video diffusion architectures
Documented limitations and bias sources are:
- Synthetic region fidelity is limited by the renderers and pose estimators, with persistent errors in fine finger articulation and close-up viewpoints.
- Non-human and animal references remain underrepresented.
- Domain gaps can emerge when transferring from synthetic reanimation to real video, though “Bias-Aware DPO” post-training modestly mitigates hand/finger biases.
7. Comparison with Prior Animation Datasets
MotionPair-60K differs fundamentally from previously established datasets:
| Dataset | Representation | Paired Cross-Identity Video | Heterogeneous Tasks |
|---|---|---|---|
| Mixamo, CMU MoCap | Skeleton only | No | No |
| iPER, VoxCelebReenact. | Video, single ident. | No | No |
| Unreal/MoCha | Video, synthetic only | Yes, but limited diversity | No |
| MotionPair-60K | End-to-end video | Yes | Yes |
MotionPair-60K is the first dataset to provide large-scale, cross-identity paired video covering both animation and replacement, unified by mask conditioning and mode-specific RoPE (Yan et al., 9 Jun 2026).
8. Code Example: Data Loading and Training
The following code specifies a canonical data loading and training routine for MotionPair-60K:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import os, json import torch import torchvision.transforms as T import numpy as np from torch.utils.data import Dataset, DataLoader from video_diffusion import VAE, I2V_Denoiser class MotionPairDataset(Dataset): def __init__(self, root, split='train'): self.root = os.path.join(root, split) self.samples = os.listdir(self.root) self.to_tensor = T.Compose([ T.Resize((512,512)), T.ToTensor(), T.Normalize(0.5, 0.5) ]) def __len__(self): return len(self.samples) def __getitem__(self, idx): sid = self.samples[idx] base = os.path.join(self.root, sid) driving = torch.load(os.path.join(base,'driving.pt')) target = torch.load(os.path.join(base,'target.pt')) ref_img = self.to_tensor(T.Image.open(os.path.join(base,'ref.png'))) masks = np.load(os.path.join(base,'masks.npz'))['masks'] masks = torch.from_numpy(masks).float() meta = json.load(open(os.path.join(base,'meta.json'))) return { 'driving': driving, 'target': target, 'ref_img': ref_img, 'masks': masks, 'meta': meta } dataset = MotionPairDataset('/path/to/MotionPair60K', split='train') loader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4) vae = VAE(pretrained='wan2.1-14B').to('cuda') denoiser = I2V_Denoiser().to('cuda') optimizer = torch.optim.AdamW(denoiser.parameters(), lr=1e-5) for epoch in range(10): for batch in loader: driving = batch['driving'].cuda() target = batch['target'].cuda() ref = batch['ref_img'].cuda() masks = batch['masks'].cuda() z_driv = vae.encode_video(driving) z_tgt = vae.encode_video(target) z_ref = vae.encode_image(ref).unsqueeze(1) z_input = torch.cat([z_ref, z_tgt, z_driv], dim=1) mask_lat = torch.nn.functional.interpolate( masks.view(-1, masks.shape[-2], masks.shape[-1]), size=z_ref.shape[2:], mode='nearest' ).view(masks.shape[0], masks.shape[1], *z_ref.shape[2:]) loss = denoiser.compute_loss(z_input, mask_lat) optimizer.zero_grad() loss.backward() optimizer.step() print(f"Epoch {epoch} loss {loss.item():.4f}") |
Utility notes:
- The dataset loader returns raw RGB tensors for videos; latent conversion is accomplished via the provided VAE.
- Masks are spatially and temporally aligned to latent representations and concatenated along prescribed axes.
- Metadata can be used to guide mode-specific conditioning and to implement “Bias-Aware DPO” post-training for bias correction in articulated regions.
Further details and release schedules are available at the project site: https://teal024.github.io/SCAIL-2/ (Yan et al., 9 Jun 2026).