Papers
Topics
Authors
Recent
2000 character limit reached

MotionEdit Dataset Overview

Updated 23 January 2026
  • MotionEdit Dataset is a collection of high-fidelity datasets that enable motion-centric editing across image, text-to-motion, and style transfer tasks with comprehensive annotations.
  • The dataset employs innovative methods such as MLLM-based filtering and dense optical flow estimation to ensure precise motion changes and physical plausibility in edits.
  • Benchmark results indicate significant performance improvements in generative quality and motion alignment metrics, fostering reproducible research in motion editing.

MotionEdit Dataset

The term "MotionEdit Dataset" refers to distinct but closely related datasets supporting the study and benchmarking of motion-centric editing tasks in both visual and motion-capture modalities. These include datasets specifically designed for (1) motion-centric image editing, (2) fine-grained text-to-motion generation and editing, and (3) motion style transfer using high-fidelity mocap data. Each instantiation incorporates rigorous annotation, comprehensive task and metric definitions, and is accompanied by open-source or commercial codebases to support reproducibility and downstream research.

1. MotionEdit for Motion-Centric Image Editing

The MotionEdit dataset introduced by Gu et al. (Wan et al., 11 Dec 2025) is the first large-scale dataset dedicated to motion-centric image editing, a task defined as modifying subject actions and interactions in an image while preserving scene identity, structural integrity, and physical plausibility. The dataset comprises 10,157 image pairs sampled directly from continuous high-fidelity Text-to-Video (T2V) corpora (ShareVeo3 and KlingAI), where each example consists of an input image, a text instruction (derived from actual observed action), and a real target (post-edit) image.

MotionEdit leverages an automatic filtering pipeline using Google Gemini, a multimodal LLM (MLLM), to ensure setting consistency, non-trivial and describable motion change, and subject integrity—discarding any low-quality samples. The MLLM also generates the edit instructions in imperative style, yielding fully pseudo-ground truth annotation without manual intervention. The dataset spans six motion categories: pose/posture, locomotion/distance, object state/formation, orientation/viewpoint, subject–object interaction, and inter-subject interaction. Its mean optical-flow magnitude (0.19, normalized) significantly exceeds previous editing datasets.

2. Dataset Structure and Motion Representation

Each MotionEdit entry consists of:

  • Input Image (IorigI_{orig})
  • Imperative Text Instruction (e.g., "Make the woman raise her right hand.")
  • Target Image (IgtI_{gt}), sampled as the last frame of a 3-second T2V segment

Motion transformations are quantified using dense optical-flow fields between pre-edit and post-edit images calculated with a high-accuracy pretrained estimator (e.g., UniMatch). Three key motion consistency terms guide both evaluation and learning:

  • Magnitude consistency DmagD_{mag}
  • Direction consistency DdirD_{dir}, with direction error weighted by ground-truth motion magnitude
  • Movement regularization MmoveM_{move}, penalizing insufficient displacement

These are combined into a composite distance Dcomb=αDmag+βDdir+λmoveMmoveD_{comb} = \alpha D_{mag} + \beta D_{dir} + \lambda_{move} M_{move} (with α=0.7, β=0.2, λmove=0.1\alpha=0.7,\ \beta=0.2,\ \lambda_{move}=0.1) and a normalized motion alignment reward. This explicit motion linkage enables precise evaluation of model capability for motion edits.

3. Benchmark, Metrics, and Baseline Results

The accompanying MotionEdit-Bench benchmark centralizes quantitative and preference-based evaluation of motion-centric editing:

  • Generative metrics (evaluated by Gemini MLLM): Fidelity, Preservation, Coherence, Overall Quality (0–5 scale)
  • Discriminative metric: Motion Alignment Score (MAS), derived from combined optical-flow components
  • Preference metric: Win Rate (MLLM-judged pairwise output preference)

A representative metric table:

Model Overall↑ MAS↑ Win Rate↑
AnyEdit 1.06 18.46 13.3%
UltraEdit 2.42 47.18 28.8%
Qwen-Image-Edit 4.65 56.46 73.0%
Qwen-Image-Edit+MotionNFT 4.72 57.23 73.9%

Existing diffusion-based models (e.g., AnyEdit, MagicBrush) perform poorly on motion-centric edits (MAS < 45), while Qwen-Image-Edit and FLUX.1 Kontext provide competitive baselines, further enhanced with the MotionNFT reward-aligned finetuning protocol (+0.4–0.5 in generative score, +1–2 in MAS) (Wan et al., 11 Dec 2025). This shows the persistent gap between general editing and precise motion-centric transformation.

4. FineMotion: Fine-Grained Text-to-Motion Generation and Editing

The FineMotion dataset (also referred to as "MotionEdit" in some contexts) (Wu et al., 26 Jul 2025) targets fine-grained spatial and temporal annotation for human motion generation and editing. It consists of 29,232 full motion sequences (sourced from AMASS and HumanAct12, preprocessed to 20 FPS, with ≤10s duration and retargeting to a canonical SMPL skeleton; split 80/5/15% train/val/test) and 442,314 0.5s-motion "snippets." Snippet-level body-part movement descriptions (BPMSD) and sequence-level paragraph descriptions (BPMP) provide dense, temporally aligned language annotation: 95% are automatically generated, 5% human-checked.

Annotations refer to explicit SMPL body parts but do not expose joint-level numeric values in text. The directory structure separates per-sequence motion NPZ files (SMPL pose + shape; d=263), snippet captions (JSON), and paragraph descriptions (JSON), facilitating ready integration into 3D motion generation pipelines.

5. Supported Tasks, Evaluation, and Quantitative Results

FineMotion enables the following core tasks:

  • Text-driven fine-grained motion generation: Given a coarse caption and fine-grained detail (both pooled by T5 into 768-D vectors), synthesize a full 3D motion sequence. Evaluation employs Retrieval-Precision@K (R@K), FrĂ©chet Inception Distance (FID), multi-modal distance, diversity, modality.
  • Zero-shot fine-grained motion editing: An interactive pipeline allowing spatial/temporal edits via modifying BPMSD text and synthesizing edited motion using a fine-tuned (T ∥ DT)2M-GPT model, with no retraining required at inference. User studies (30 viewers, 9 cases) demonstrate superior edit meeting rate, naturalness, and parity in similarity to original compared to FLAME and T2M-GPT.
  • Benchmark splits mirror HumanML3D, ensuring non-overlapping sequence IDs.

Excerpts of key quantitative results:

Model R@3 ↑ FID ↓
MDM (T only) 0.606 3.137
(T∥BPMSD)-MDM 0.745 0.756

Improvements are +13.9ppt in R@3 and ~75% FID reduction. For T2M-GPT and MoMask, similar lifts are observed; average Top-3 accuracy improvement is 10–15ppt; MDM achieves +15.3% as reported (Wu et al., 26 Jul 2025).

6. Bandai-Namco-Research-Motiondataset for Motion Style Transfer

The Bandai-Namco-Research-Motiondataset ("MotionEdit Dataset" in title, but generally cited as BNR-Motion) (Kobayashi et al., 2023) comprises two open datasets (Ours-1: 36,665 frames, 17 content types Ă— 15 styles; Ours-2: 384,931 frames, 10 content types Ă— 7 styles) with industry-standard 21-joint humanoid rigs and clearly delimited style labels (e.g., normal, active, exhausted, youthful). Motion actors follow strict consistency/differentiation guidelines. Data is provided as BVH/FBX at 30 FPS in Y-up, right-handed coordinates and is ready for use in Unity/Unreal pipelines.

The dataset's evaluation adopts the Motion Puzzle protocol (style/content disentanglement) and demonstrates strong qualitative results in style preservation, smooth style interpolation, and seamless retargeting to various character rigs. Baseline content-style transfer models (notably latent-space mixing and pooling modifications near the shoulders) are supported, and no numeric metric for style distance is defined in the paper (Kobayashi et al., 2023).

7. Availability, Licensing, and Usage

A plausible implication is that together these resources serve complementary needs: high-fidelity real human edits, text-grounded control, and detailed style transfer—all under the "MotionEdit Dataset" umbrella.


References:

  • (Wan et al., 11 Dec 2025) MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
  • (Wu et al., 26 Jul 2025) FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing
  • (Kobayashi et al., 2023) Motion Capture Dataset for Practical Use of AI-based Motion Editing and Stylization

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionEdit Dataset.