MotionPair-60K: End-to-End Motion Dataset

Updated 14 June 2026

MotionPair-60K is a large-scale dataset offering raw paired video triads for end-to-end motion transfer without relying on intermediate skeletons or masks.
It unifies diverse animation tasks—including single and multi-character animation and replacement—through soft-guidance signals and in-context conditioning.
The dataset supports robust benchmarking and bias-mitigated modeling, advancing scalable image-to-video generative research in controlled character animation.

MotionPair-60K is a large-scale, end-to-end motion transfer dataset designed to advance research in controlled character animation without reliance on intermediate representations such as skeletons or background masks. Unlike legacy datasets that depend on pose or mask-based parsing—thereby discarding complex visual cues—MotionPair-60K provides paired driving/reference/target video triads spanning heterogeneous animation tasks, with all visual information preserved and soft-guidance signals supporting multi-task training. The dataset was introduced in the context of the SCAIL-2 framework for unified, in-context-conditioned animation, aiming to address fragmentation across sub-tasks and to support robust, scalable, and bias-mitigated modeling for both single- and multi-character motion transfer (Yan et al., 9 Jun 2026).

1. Motivation and Design Rationale

Traditional character animation pipelines rely on intermediate skeleton representations or masked backgrounds, resulting in significant loss of visual structure related to occlusions, object interactions, and environmental context. Additionally, prior video diffusion approaches that operate in an end-to-end manner require large-scale paired data—that is, videos of the same motion performed by different characters in possibly disparate settings—which have been lacking due to the compositional complexity of animation tasks.

MotionPair-60K was thus curated to fulfill several key objectives:

Enable paired, end-to-end learning where the network sees all raw visual input and generates the full output video, with no explicit intermediate mask or skeleton as the target.
Unify the training of single-character animation, multi-character animation, and character/environment replacement tasks under a single, mask-guided conditioning scheme using “soft guidance” (in-context masks and mode-specific RoPE).
Provide scale, task diversity, and annotation fidelity sufficient for benchmarking and advancing I2V (image-to-video) generative models in broad character animation scenarios.
Facilitate post-training bias mitigation, especially for synthetic artifacts in fine articulated regions, via a dedicated “Bias-Aware DPO” protocol (Yan et al., 9 Jun 2026).

2. Data Synthesis Pipeline

The assembly of MotionPair-60K follows an agentic synthetic loop structured around high-fidelity pair and triplet generation. Central steps include:

Animation Pair Synthesis: Real driving videos are sampled from large human-centric video pools. Candidate character images are selected to match the pose distribution of the driving sequence, and a prompt-driven generator (e.g., Gemini API) fabricates reference images congruent with the target character’s anticipated style and posture. Pose-driven animation generators (SCAIL; Wan-Animate for close-ups) are employed to create the reanimated video, which is then quality-checked via a lightweight vision-LLM (reject rate < 30%).
Replacement Generation: For character replacement tasks, the MoCha renderer generates both single- and multi-character replacement pairs, manipulating the environment and identity bindings to exercise occlusion and environmental inference.
Reverse Driving: Triplets are constructed by reversing the animation process. Given a real sequence and its synthetic reanimation, the triplet (driving, reference, target) enables supervision where the driving input transmits only motion, and the target encompasses all environmental realism and object interactions.

Key underlying generative operators are defined as:

Forward diffusion in VAE latent space:

$q(z_t \mid z_{t-1}) = \mathcal{N}(z_t; \sqrt{1-\beta_t}\; z_{t-1}, \beta_t I)$

Denoising loss:

$\mathcal{L} = \mathbb{E}_{z_t,\epsilon \sim \mathcal{N}(0,I)} \left[\|\epsilon - \epsilon_\theta(z_t, t, c)\|_2^2\right]$

Core animation generator:

$\tilde{y} = \mathcal{G}(y, I)$

3. Dataset Composition and Statistics

MotionPair-60K comprises 59,376 end-to-end motion transfer pairs. The distribution of samples by generator/model and task is as follows:

Task Type	Generator	Pair Count	Percentage
Single-character animation	SCAIL	31,895	53.7%
Single-character animation	Wan-Animate	13,847	23.3%
Single-character replacement	MoCha	9,249	15.6%
Multi-character replacement	MoCha	4,385	7.4%

Training sampling ratios: 60% end-to-end animation, 20% replacement, 20% additional pose-driven skeleton/video pairs (~100,000 samples; note these are for pretraining only).
Each video is stored at 512×512 pixel resolution, typically with 16–32 frames.
All videos are encoded with a pretrained VAE, mapping each frame to a latent grid (commonly 16×16 per spatial axis), facilitating efficient downstream modeling.

4. Data Structure, Modalities, and Annotation

MotionPair-60K provides both raw visual data and soft-guidance annotation:

Modalities:
- Driving video (input, mp4)
- Target video (output, mp4)
- Reference image (single frame, png)
- In-context masks: environment switch mask plus $K=6$ character-binding masks (npy/npz, stacked)
- Optional pose-skeleton annotation (json) for pose-driven samples
Mask Encoding:
- 6 binding slots and 1 environment channel, yielding $4 \times (K+1)$ mask channels after spatiotemporal aggregation.
- Masks are spatially downsampled to match VAE latent resolution and stacked along channel/time axes.
Metadata (metadata.json):
- Task type (Animation/Replacement)
- Generator used (“SCAIL”, “Wan-Animate”, “MoCha”)
- Binding map $\pi$ (driving to reference character ID mapping)
- Source of environment
- Frame count, resolution, sample ID

5. Splits, Cross-Domain Generalization, and Evaluation Protocol

The recommended split is 80% training (~47,500), 10% validation (~5,900), and 10% test (~5,900). The test set is curated to contain held-out character identities, novel backgrounds, and multi-character interactions to robustly evaluate cross-domain and cross-identity generalization.

Task domains incorporate:

Diverse driving activities (human, prop interactions, select non-human/cartoon references)
Replacement scenarios demanding integration of characters and backgrounds from disparate scenes

A plausible implication is that the dataset structure and split strategies encourage robust model evaluation on out-of-distribution animation phenomena.

6. Licensing, Use Cases, and Known Limitations

MotionPair-60K—along with SCAIL-2 model weights—is to be released under a CC-BY-NC 4.0 license. Major envisioned uses include:

Large-scale benchmarking for end-to-end character animation, especially with cross-identity motion transfer and environment recomposition
Production workflows for film/game where reanimation or character replacement is desired without intermediate parsing
Testing and development of in-context video conditioning or video diffusion architectures

Documented limitations and bias sources are:

Synthetic region fidelity is limited by the renderers and pose estimators, with persistent errors in fine finger articulation and close-up viewpoints.
Non-human and animal references remain underrepresented.
Domain gaps can emerge when transferring from synthetic reanimation to real video, though “Bias-Aware DPO” post-training modestly mitigates hand/finger biases.

7. Comparison with Prior Animation Datasets

MotionPair-60K differs fundamentally from previously established datasets:

Dataset	Representation	Paired Cross-Identity Video	Heterogeneous Tasks
Mixamo, CMU MoCap	Skeleton only	No	No
iPER, VoxCelebReenact.	Video, single ident.	No	No
Unreal/MoCha	Video, synthetic only	Yes, but limited diversity	No
MotionPair-60K	End-to-end video	Yes	Yes

MotionPair-60K is the first dataset to provide large-scale, cross-identity paired video covering both animation and replacement, unified by mask conditioning and mode-specific RoPE (Yan et al., 9 Jun 2026).

8. Code Example: Data Loading and Training

The following code specifies a canonical data loading and training routine for MotionPair-60K:

import os, json
import torch
import torchvision.transforms as T
import numpy as np
from torch.utils.data import Dataset, DataLoader
from video_diffusion import VAE, I2V_Denoiser

class MotionPairDataset(Dataset):
    def __init__(self, root, split='train'):
        self.root = os.path.join(root, split)
        self.samples = os.listdir(self.root)
        self.to_tensor = T.Compose([
            T.Resize((512,512)), T.ToTensor(), T.Normalize(0.5, 0.5)
        ])
    def __len__(self):
        return len(self.samples)
    def __getitem__(self, idx):
        sid = self.samples[idx]
        base = os.path.join(self.root, sid)
        driving = torch.load(os.path.join(base,'driving.pt'))
        target  = torch.load(os.path.join(base,'target.pt'))
        ref_img = self.to_tensor(T.Image.open(os.path.join(base,'ref.png')))
        masks = np.load(os.path.join(base,'masks.npz'))['masks']
        masks = torch.from_numpy(masks).float()
        meta = json.load(open(os.path.join(base,'meta.json')))
        return {
            'driving': driving,
            'target':  target,
            'ref_img': ref_img,
            'masks':   masks,
            'meta':    meta
        }

dataset = MotionPairDataset('/path/to/MotionPair60K', split='train')
loader  = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)
vae     = VAE(pretrained='wan2.1-14B').to('cuda')
denoiser = I2V_Denoiser().to('cuda')
optimizer = torch.optim.AdamW(denoiser.parameters(), lr=1e-5)

for epoch in range(10):
    for batch in loader:
        driving = batch['driving'].cuda()
        target  = batch['target'].cuda()
        ref     = batch['ref_img'].cuda()
        masks   = batch['masks'].cuda()
        z_driv  = vae.encode_video(driving)
        z_tgt   = vae.encode_video(target)
        z_ref   = vae.encode_image(ref).unsqueeze(1)
        z_input = torch.cat([z_ref, z_tgt, z_driv], dim=1)
        mask_lat = torch.nn.functional.interpolate(
            masks.view(-1, masks.shape[-2], masks.shape[-1]),
            size=z_ref.shape[2:], mode='nearest'
        ).view(masks.shape[0], masks.shape[1], *z_ref.shape[2:])
        loss = denoiser.compute_loss(z_input, mask_lat)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch} loss {loss.item():.4f}")

Utility notes:

The dataset loader returns raw RGB tensors for videos; latent conversion is accomplished via the provided VAE.
Masks are spatially and temporally aligned to latent representations and concatenated along prescribed axes.
Metadata can be used to guide mode-specific conditioning and to implement “Bias-Aware DPO” post-training for bias correction in articulated regions.

Further details and release schedules are available at the project site: https://teal024.github.io/SCAIL-2/ (Yan et al., 9 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionPair-60K Dataset.

MotionPair-60K: End-to-End Motion Dataset

1. Motivation and Design Rationale

2. Data Synthesis Pipeline

3. Dataset Composition and Statistics

4. Data Structure, Modalities, and Annotation

5. Splits, Cross-Domain Generalization, and Evaluation Protocol

6. Licensing, Use Cases, and Known Limitations

7. Comparison with Prior Animation Datasets

8. Code Example: Data Loading and Training

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MotionPair-60K: End-to-End Motion Dataset

1. Motivation and Design Rationale

2. Data Synthesis Pipeline

3. Dataset Composition and Statistics

4. Data Structure, Modalities, and Annotation

5. Splits, Cross-Domain Generalization, and Evaluation Protocol

6. Licensing, Use Cases, and Known Limitations

7. Comparison with Prior Animation Datasets

8. Code Example: Data Loading and Training

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research