Papers
Topics
Authors
Recent
Search
2000 character limit reached

FrankenMotion: Diffusion-Based Motion Synthesis

Updated 19 January 2026
  • The paper introduces FrankenMotion, a diffusion-based framework that uses the novel FrankenStein dataset with atomic, asynchronous, part-level motion annotations.
  • It employs a transformer-based DDPM conditioned on global, atomic action, and frame-wise part prompts to generate coherent, anatomically realistic full-body sequences.
  • Quantitative results demonstrate significant improvements in semantic correctness and realism over state-of-the-art methods, supporting compositional generalization from unseen part-action pairings.

FrankenMotion is a diffusion-based part-aware human motion generation framework designed to address limitations in spatial and temporal controllability of text-driven motion synthesis. It introduces a novel hierarchically annotated motion dataset—FrankenStein—with atomic, asynchronous, temporally precise part-level labels, and a transformer diffusion model capable of composing full-body sequences with granular, multi-stream textual conditioning. The framework demonstrates significant advances over prior state-of-the-art methods in semantic correctness and realism, particularly in generating coherent motions from unseen combinations of part and action prompts (Li et al., 15 Jan 2026).

1. Hierarchical Motion Data Construction

Central to FrankenMotion is the FrankenStein dataset, comprising 39 hours of human motion and 138.5K annotations spanning three levels of granularity: sequence, atomic action, and body part. Each annotation element is defined as a=(L,ts,te)a = (L, t_s, t_e), with LL representing a short text label and [ts,te][t_s, t_e] denoting its temporal interval. Sequence-level annotations AsA_s summarize entire clips, atomic actions AaA_a partition the time interval into non-overlapping windows, and part-level annotations ApkA_{p_k} provide fine-grained body part motions (e.g., "left arm swings forward" from frame 12–24).

To generate detailed, asynchronous part-level labels, the framework employs a LLM (Deepseek-R1), referred to as “FrankenAgent.” This agent ingests existing sequence and action labels from established datasets (KIT-ML, BABEL, HumanML3D), decomposes each action into constituent part movements, aligns them to precise temporal intervals, and outputs “unknown” in cases of uncertainty. The annotation pipeline achieves 93.08% correctness in human evaluations on random samples (Gwet’s AC₁ = 0.91). Unlike prior datasets offering either global descriptions (HumanML3D, KIT-ML) or stage-based synchronized part captions (FineMoGen), FrankenStein provides asynchronous, semantically distinct atomic labels for each body part at fine temporal resolution (mean segment length 4.8s).

Dataset Label Type Temporal Resolution Synchronization
KIT-ML Global Sequence Coarse Synchronized
FineMoGen Stage-based Part Coarse intervals Synchronized
FrankenStein Atomic Part-level Fine-grained (4.8s) Asynchronous

This hierarchical structuring is foundational for enabling compositional, flexible generation of elementary motion elements.

2. Diffusion-based Framework Architecture

FrankenMotion’s core is a transformer-based DDPM (Denoising Diffusion Probabilistic Model) conditioned on multi-granularity textual prompts. Three control streams govern generation:

  • Sequence Prompt LsL_s: Global description (e.g., "sit down and wave").
  • Atomic Action Prompts LaL_a: Windowed action-specific texts (e.g., "bend knees" on frames 1–20).
  • Frame-wise Part Prompts LpL_p: Per-frame, per-part specific cues, up to KK body parts per frame.

Pose representation adopts SMPL-based features: x=[rz,x˙,y˙,ω,θ,j]x = [r_z, \dot{x}, \dot{y}, \omega, \theta, j] where rzr_z is pelvis height, (x˙,y˙)(\dot{x}, \dot{y}) are horizontal velocities, ω\omega is yaw-rate, θ\theta encodes 6D SMPL pose, and jj are local joint positions. The DDPM forward process is:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t | x_{t-1}) = \mathcal{N}(x_t ; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

The reverse denoiser is learned as:

pθ(xt1xt,c)=N(xt1;μθ(xt,t,c),ΣtI)p_\theta(x_{t-1} | x_t, c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \Sigma_t I)

with conditioning c=(Ls,La,Lp)c = (L_s, L_a, L_p). The model employs ϵ\epsilon-prediction, training

fθ:(xσ,σ,c)x^0f_\theta: (x_\sigma, \sigma, c) \to \hat{x}_0

using the simplified loss

L=Ex0,σ,ε[fθ(xσ,σ,c)x022].\mathcal{L} = \mathbb{E}_{x_0, \sigma, \varepsilon} [\|f_\theta(x_\sigma, \sigma, c) - x_0\|_2^2].

Spatio-temporal Prompt Embeddings

Textual prompts are CLIP-encoded; action and part features are PCA-reduced to D=50D=50 dimensions. Action embeddings FaRW×DF_a \in \mathbb{R}^{W \times D} and part embeddings FpRT×(K×D)F_p \in \mathbb{R}^{T \times (K \times D)} are concatenated with noisy motion, projected into a fusion space. Sequence prompt and diffusion timestep embeddings are prepended, yielding an input of size (T+2)×Dm+t(T+2) \times D_{m+t} to the transformer's backbone. The network’s self-attention facilitates joint reasoning over spatial (body part) and temporal (action) conditions.

For handling sparsity (e.g., “unknown” prompts), a stochastic masking strategy akin to β\beta-Dropout is applied: for each non-unknown part prompt, a masking probability pBeta(5r,5(1r))p \sim \mathrm{Beta}(5r, 5(1-r)) determines random zeroing of embeddings, promoting robustness to partial conditioning.

Implementation utilizes a cosine noise schedule (100 steps), AdamW (lr = 2e-4, batch size = 32), and requires approximately 47 hours training on NVIDIA H100 GPUs.

3. Compositional Motion Generation and Generalization

Leveraging atomic part-level motion annotations and their temporal alignment with higher-level actions, FrankenMotion enables composition of novel, previously unseen motion combinations. For example, generating a clip with the global sequence prompt "sit down" together with concurrent atomic part cues such as "bend left knee" (frames 1–10) and "raise right arm" (frames 5–15)—which do not co-occur in training—yields coherent composed motion. The formal generation process is

x^0=fθ(xTx0Ls,La,Lp)\hat{x}_0 = f_\theta(x_T \to \ldots \to x_0 | L_s, L_a, L_p)

where LpL_p supplies time-indexed part prompts and LaL_a, LsL_s provide broader guidance. Model self-attention ensures spatial consistency of body-part sub-motions (e.g., anatomical realism) and temporal smoothness (avoiding abrupt transitions).

A plausible implication is improved flexibility in animation pipelines and multibody control scenarios, as the architecture generalizes to unseen part-action pairings without requiring retraining or post-hoc stitching.

4. Comparative Evaluation and Quantitative Results

FrankenMotion’s performance is benchmarked against three state-of-the-art models retrained on FrankenStein’s labels: STMC (post-hoc MDM composition), DART (autoregressive diffusion), and UniMotion (hierarchical diffusion). Evaluation uses a held-out 10% test set and measures semantic correctness (R-Precision@1,3, M2T) and realism (FID, Diversity) at three levels: part, action, and sequence, scored by pretrained evaluator models following TMR protocols.

Quantitative outcomes:

Method R@1 (Part) FID (Sequence)
FrankenMotion 47.21% 0.06
UniMotion 45.72% 0.08
STMC 40.67% 0.20
DART 38.67%

Ablation experiments dissect the impact of hierarchical conditioning:

  • Part-only: R@3 = 56.34%, M2T = 0.69, FID = 0.08
  • +Actions: R@3 = 57.74%, FID = 0.07
  • +Sequence: R@3 = 58.97%, FID = 0.05 (approaching ground-truth upper bound)

Qualitative results indicate that STMC stitching yields jerky transitions, DART autoregressive generation repeats sub-motions, and UniMotion misses fine part details. FrankenMotion, in contrast, adheres to part-level cues, achieves smooth transitions under supervisory atomic and sequence prompts, and successfully composes motions excluded from training data.

5. Significance, Contributions, and Release

FrankenMotion’s contributions are three-fold:

  1. FrankenStein Dataset: First of its kind with atomic, temporally-aware part-level motion annotations, establishing a new annotation granularity and temporal asynchrony standard.
  2. Transformer Diffusion Model: Capable of unified spatial (per-part) and temporal (per-action) control through joint embedding and self-attention, supporting multi-stream conditioning efficiently.
  3. Compositional Generalization: Demonstrates capacity to synthesize realistic full-body sequences from unseen combinations of part and action inputs, with performance validated via advanced quantitative and human evaluations.

Code and dataset will be released upon publication (Li et al., 15 Jan 2026). This architecture suggests practical utility for controllable character animation, multi-agent systems, and human-centric simulation domains requiring interpretable, fine-grained, and temporally dynamic motion synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FrankenMotion.