FrankenMotion: Diffusion-Based Motion Synthesis
- The paper introduces FrankenMotion, a diffusion-based framework that uses the novel FrankenStein dataset with atomic, asynchronous, part-level motion annotations.
- It employs a transformer-based DDPM conditioned on global, atomic action, and frame-wise part prompts to generate coherent, anatomically realistic full-body sequences.
- Quantitative results demonstrate significant improvements in semantic correctness and realism over state-of-the-art methods, supporting compositional generalization from unseen part-action pairings.
FrankenMotion is a diffusion-based part-aware human motion generation framework designed to address limitations in spatial and temporal controllability of text-driven motion synthesis. It introduces a novel hierarchically annotated motion dataset—FrankenStein—with atomic, asynchronous, temporally precise part-level labels, and a transformer diffusion model capable of composing full-body sequences with granular, multi-stream textual conditioning. The framework demonstrates significant advances over prior state-of-the-art methods in semantic correctness and realism, particularly in generating coherent motions from unseen combinations of part and action prompts (Li et al., 15 Jan 2026).
1. Hierarchical Motion Data Construction
Central to FrankenMotion is the FrankenStein dataset, comprising 39 hours of human motion and 138.5K annotations spanning three levels of granularity: sequence, atomic action, and body part. Each annotation element is defined as , with representing a short text label and denoting its temporal interval. Sequence-level annotations summarize entire clips, atomic actions partition the time interval into non-overlapping windows, and part-level annotations provide fine-grained body part motions (e.g., "left arm swings forward" from frame 12–24).
To generate detailed, asynchronous part-level labels, the framework employs a LLM (Deepseek-R1), referred to as “FrankenAgent.” This agent ingests existing sequence and action labels from established datasets (KIT-ML, BABEL, HumanML3D), decomposes each action into constituent part movements, aligns them to precise temporal intervals, and outputs “unknown” in cases of uncertainty. The annotation pipeline achieves 93.08% correctness in human evaluations on random samples (Gwet’s AC₁ = 0.91). Unlike prior datasets offering either global descriptions (HumanML3D, KIT-ML) or stage-based synchronized part captions (FineMoGen), FrankenStein provides asynchronous, semantically distinct atomic labels for each body part at fine temporal resolution (mean segment length 4.8s).
| Dataset | Label Type | Temporal Resolution | Synchronization |
|---|---|---|---|
| KIT-ML | Global Sequence | Coarse | Synchronized |
| FineMoGen | Stage-based Part | Coarse intervals | Synchronized |
| FrankenStein | Atomic Part-level | Fine-grained (4.8s) | Asynchronous |
This hierarchical structuring is foundational for enabling compositional, flexible generation of elementary motion elements.
2. Diffusion-based Framework Architecture
FrankenMotion’s core is a transformer-based DDPM (Denoising Diffusion Probabilistic Model) conditioned on multi-granularity textual prompts. Three control streams govern generation:
- Sequence Prompt : Global description (e.g., "sit down and wave").
- Atomic Action Prompts : Windowed action-specific texts (e.g., "bend knees" on frames 1–20).
- Frame-wise Part Prompts : Per-frame, per-part specific cues, up to body parts per frame.
Pose representation adopts SMPL-based features: where is pelvis height, are horizontal velocities, is yaw-rate, encodes 6D SMPL pose, and are local joint positions. The DDPM forward process is:
The reverse denoiser is learned as:
with conditioning . The model employs -prediction, training
using the simplified loss
Spatio-temporal Prompt Embeddings
Textual prompts are CLIP-encoded; action and part features are PCA-reduced to dimensions. Action embeddings and part embeddings are concatenated with noisy motion, projected into a fusion space. Sequence prompt and diffusion timestep embeddings are prepended, yielding an input of size to the transformer's backbone. The network’s self-attention facilitates joint reasoning over spatial (body part) and temporal (action) conditions.
For handling sparsity (e.g., “unknown” prompts), a stochastic masking strategy akin to -Dropout is applied: for each non-unknown part prompt, a masking probability determines random zeroing of embeddings, promoting robustness to partial conditioning.
Implementation utilizes a cosine noise schedule (100 steps), AdamW (lr = 2e-4, batch size = 32), and requires approximately 47 hours training on NVIDIA H100 GPUs.
3. Compositional Motion Generation and Generalization
Leveraging atomic part-level motion annotations and their temporal alignment with higher-level actions, FrankenMotion enables composition of novel, previously unseen motion combinations. For example, generating a clip with the global sequence prompt "sit down" together with concurrent atomic part cues such as "bend left knee" (frames 1–10) and "raise right arm" (frames 5–15)—which do not co-occur in training—yields coherent composed motion. The formal generation process is
where supplies time-indexed part prompts and , provide broader guidance. Model self-attention ensures spatial consistency of body-part sub-motions (e.g., anatomical realism) and temporal smoothness (avoiding abrupt transitions).
A plausible implication is improved flexibility in animation pipelines and multibody control scenarios, as the architecture generalizes to unseen part-action pairings without requiring retraining or post-hoc stitching.
4. Comparative Evaluation and Quantitative Results
FrankenMotion’s performance is benchmarked against three state-of-the-art models retrained on FrankenStein’s labels: STMC (post-hoc MDM composition), DART (autoregressive diffusion), and UniMotion (hierarchical diffusion). Evaluation uses a held-out 10% test set and measures semantic correctness (R-Precision@1,3, M2T) and realism (FID, Diversity) at three levels: part, action, and sequence, scored by pretrained evaluator models following TMR protocols.
Quantitative outcomes:
| Method | R@1 (Part) | FID (Sequence) |
|---|---|---|
| FrankenMotion | 47.21% | 0.06 |
| UniMotion | 45.72% | 0.08 |
| STMC | 40.67% | 0.20 |
| DART | 38.67% | — |
Ablation experiments dissect the impact of hierarchical conditioning:
- Part-only: R@3 = 56.34%, M2T = 0.69, FID = 0.08
- +Actions: R@3 = 57.74%, FID = 0.07
- +Sequence: R@3 = 58.97%, FID = 0.05 (approaching ground-truth upper bound)
Qualitative results indicate that STMC stitching yields jerky transitions, DART autoregressive generation repeats sub-motions, and UniMotion misses fine part details. FrankenMotion, in contrast, adheres to part-level cues, achieves smooth transitions under supervisory atomic and sequence prompts, and successfully composes motions excluded from training data.
5. Significance, Contributions, and Release
FrankenMotion’s contributions are three-fold:
- FrankenStein Dataset: First of its kind with atomic, temporally-aware part-level motion annotations, establishing a new annotation granularity and temporal asynchrony standard.
- Transformer Diffusion Model: Capable of unified spatial (per-part) and temporal (per-action) control through joint embedding and self-attention, supporting multi-stream conditioning efficiently.
- Compositional Generalization: Demonstrates capacity to synthesize realistic full-body sequences from unseen combinations of part and action inputs, with performance validated via advanced quantitative and human evaluations.
Code and dataset will be released upon publication (Li et al., 15 Jan 2026). This architecture suggests practical utility for controllable character animation, multi-agent systems, and human-centric simulation domains requiring interpretable, fine-grained, and temporally dynamic motion synthesis.