Motion Inbetweening in Animation

Updated 17 October 2025

Motion inbetweening is a technique that synthesizes intermediate frames between user-specified keyframes to produce smooth, naturalistic character motion.
It employs diverse architectures, from GANs and RNNs to diffusion and transformer models, to balance biomechanical constraints with global motion consistency.
Applications span game development, VR, and motion editing while addressing challenges like constraint satisfaction, error accumulation, and style diversity.

Motion inbetweening is the process of generating intermediate frames or motion sequences that transition smoothly between sparse, user-specified keyframes, typically for kinematic character animation. The core objective is to synthesize temporally and spatially plausible motion that strictly adheres to the poses and timing of keyframes while filling in complex, detailed, and naturalistic motion between them. The task is central to computer animation, game development, and interactive virtual environments, driving the need for scalable, controllable, and physically consistent solutions.

1. Problem Definition and Principal Challenges

Motion inbetweening concerns automatic synthesis of full-body character motion between arbitrary, potentially sparse, and user-driven keyframes. The generation must reconstruct precise joint poses and root trajectories at keyframe anchors while interpolating naturally—balancing both micro-level biomechanical constraints (e.g., joint limits, bone lengths) and holistic motion dynamics (e.g., global path smoothness, style consistency) (Zhou et al., 2020). Major challenges include:

Constraint satisfaction: Meeting exact keyframe positions and timings as specified, even in cases of extremely sparse constraints.
Biomechanical and physical plausibility: Limiting joint rotations, preserving bone lengths, and ensuring foot contacts to avoid artifacts such as “foot sliding” or gimbal lock.
Accumulation of local errors: Small errors in joint-space transitions can amplify, leading to global drift.
Motion style diversity and control: Generating a range of plausible inbetweeners for the same input sequence and enabling user intervention (e.g., to select styles or inject intermediate poses).
Generalization: Adapting to different skeletal morphologies, diverse motion classes, noisy or imprecise keyframes, and complex scene constraints.

2. Algorithmic and Model Architectures

A number of algorithmic paradigms for motion inbetweening have emerged.

Two-Stage and Modular Architectures

Several methodologies separate motion generation into local and global stages. For example, one system first predicts per-joint rotations through a 1D convolutional network, incorporating a biomechanically constrained FK layer, then reconstructs the global root trajectory by predicting and integrating root displacements (Zhou et al., 2020).

GANs, RNNs, and Adversarial Training

Architectures employing (least-squares) GANs with LSTM or 1D-CNN generators are prevalent, often augmented with domain-specific losses (e.g., FK, contact, NPSS) and curriculum learning to support long-horizon transitions (Harvey et al., 2021). Adversarial critics inspect motion fragments at multiple temporal scales to ensure both local plausibility and global consistency.

Transformer-based Interpolators

Modern approaches employ Transformers, either in single-stage or two-stage designs. Non-autoregressive transformer encoders are used for masked motion modeling tasks, conditioning on known start, end, and optional anchor poses, adopting sequence-level or patch-level masking. Positive effects are recorded from directly representing missing inbetween frames with zeros and from using root-space rather than local-to-parent representations (Akhoundi et al., 9 Jun 2025).

Residual (Delta) Estimators

Delta-based architectures, such as the “deep Δ-interpolator,” learn corrections relative to analytical interpolators (e.g., SLERP). This mode operates on both locally referenced coordinates (e.g., centered at last keyframe) and outputs only corrections, simplifying learning and improving robustness to out-of-distribution shifts (Oreshkin et al., 2022).

Patch Decomposition and Part-Wise Control

Task-independent, ViT-inspired architectures “patchify” the skeleton into body parts, enabling mask-aware and occlusion-robust motion synthesis. Explicit modeling of per-part phase information (via periodic autoencoders) and mixtures-of-experts allows for fine-grained, body-part-level style and motion control (Mascaro et al., 2023, Dai et al., 11 Mar 2025).

Diffusion Models

Recent progress reframes inbetweening as a conditional denoising diffusion problem, where motion is gradually refined from a noise prior with hard or soft enforcement of keyframe (and possibly partial-joint) constraints. Conditioning can flexibly encode keyframes, text semantics, and even arbitrary scene geometry (Cohan et al., 17 May 2024, Hwang et al., 20 Mar 2025, Cho et al., 14 Oct 2025). Diffusion models’ ability to generate diverse, high-quality samples and to incorporate complex conditionings has led to state-of-the-art inbetweening under flexible constraints.

Example Formulation (Diffusion-based):

Forward: $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$
Reverse: Trained denoiser predicts $x_0$ from $x_t$ and re-injects observed keyframes via masking, using guidance for hard constraint satisfaction.

3. Conditioning, Guidance, and Control Mechanisms

Motion DNA and User Controllability

Motion DNA refers to low-dimensional latent vectors encoding motion “style” extracted from representative frames and injected into generative networks, allowing animators to influence stylistic properties of generated motion (e.g., walk vs. martial arts vs. dance) for the same keyframe constraint (Zhou et al., 2020).

Semantic and Anchor Conditioning

Methods embed not only keyframes but also semantic tokens and intermediate “anchor” poses as conditionings. Semantic tokens can represent actions (e.g., “jump,” “run”), enabling higher-level control. Anchor-based methods introduce sampled intermediate poses—enforcing them as hard conditions during training, which improves diversity and local accuracy (Kim et al., 2022).

Time-to-Arrival and Noise Injection

Enhancing autoregressive RNNs, sinusoidal time-to-arrival embeddings signal the temporal distance to the next keyframe at each time step. Scheduled target noise vectors diversify possible transitions and improve robustness when keyframes are uncertain, with their effect annealed near constraint satisfaction (Harvey et al., 2021).

Scene and Interaction-Aware Conditioning

For inbetweening in 3D scenes, scene descriptors—such as ViT-encoded occupancy grids (global context) and BPS (Basis Point Set) features at keyframes (local context)—are fused into the generator. Cross-attention is leveraged in diffusion backbones to condition on scene geometry at per-frame granularity (Hwang et al., 20 Mar 2025, Cho et al., 14 Oct 2025).

Bi-directional and Cross-Space Synthesis

For two-character or highly interactive scenarios, bi-directional synthesis generates intermediate motion both forward and backward, then stitches results in the overlap. Cross-space in-betweening further reasons about each character’s actions in both their own and their partner’s conditioning spaces, using FiLM and adversarial periodicity modeling to preserve interaction (Ren et al., 2023, Zhang et al., 30 Sep 2025).

4. Quantitative Evaluation and Datasets

Typical datasets include CMU Motion Capture, LaFAN1 (transition-focused), Human3.6M, Anidance, and human-scene interaction recordings (e.g., TRUMANS, GIMO).

Main metrics:

Metric	Description	Context
L2P, L2Q	Mean Euclidean error on joint positions or quaternions	Global and local reconstruction errors
NPSS	Normalized Power Spectrum Similarity, correlates with human	Temporal frequency realism
FID	Frechet Inception Distance for feature distribution	Distributional similarity to real motion
KPE, SLDE, ADE	Keypose or end-point errors	Constraint and target adherence
Foot Sliding	Contact error for physical plausibility	Measures artifacts such as foot slip

Qualitative and user studies measure perceptual plausibility and diversity, especially in creative and hard-constrained settings (Zhou et al., 2020, Ren et al., 2023, Zhang et al., 30 Sep 2025).

5. Applications, Generalization, and Limitations

Applications

Animation pipelines: Reduction of manual keyframing, flexible and quick motion prototyping (Zhou et al., 2020).
Game development and VR: Real-time character animation, consistent adaptation to sparse inputs (e.g., tracked hands/head for full-body inference) (Cohan et al., 17 May 2024).
Motion editing and retargeting: Style transfer via latent control signals, robust motion rescaffolding of edited sequences (Kim et al., 2022, Dai et al., 11 Mar 2025).
Human-scene interaction: Physics- and scene-aware inbetweening that respects obstacles, affordances, and object contact (Hwang et al., 20 Mar 2025, Cho et al., 14 Oct 2025).

Generalization

Diffusion and transformer-based inbetweeners generalize to arbitrary character skeletons, supported by canonicalization and physics-based retargeting stages (Qin, 13 Apr 2025).
Methods robust to timing errors in keyframes, leveraging learned or predicted time-warp functions for flexible infilling (Goel et al., 2 Mar 2025).
Scene conditioning and explicit occlusion/mask indicators improve handling of occluded/missing joints, supporting monocular video or sensor-based motion completion (Mascaro et al., 2023, Jang et al., 13 Nov 2024, Hwang et al., 20 Mar 2025).

Limitations and Open Issues

Physics consistency (e.g., ground contacts, joint limits) is still not universally guaranteed; many frameworks rely on additional RL or simulation controllers to resolve violations (Qin, 13 Apr 2025).
Handling extreme motion ambiguity (large temporal gaps, non-repetitive actions) still presents challenges, especially in interactive or densely populated scenes (Zhang et al., 30 Sep 2025).
Computational complexity for diffusion models is nontrivial; runtime is generally higher than for feed-forward alternatives (Cohan et al., 17 May 2024).
For real-world deployment (e.g., video-based motion infilling), methods must address noise and uncertainty in both pose and scene geometry (Jang et al., 13 Nov 2024, Hwang et al., 20 Mar 2025).

6. Trends, Future Directions, and Broader Implications

Generative Diversity and Probabilistic Synthesis: Explicit probabilistic modeling (diffusion, CVAE, and delta-based frameworks) is preferred for generating multiple plausible inbetweeners—critical for creative workflows and dynamic environments (Cohan et al., 17 May 2024, Ren et al., 2023).
Scene and Context-Aware Synthesis: Cross-modal adaptation techniques inject scene- and semantics-awareness into unified architectures, often by leveraging proxy tasks like inbetweening to bridge dataset or modality gaps (Cho et al., 14 Oct 2025).
Controllable Synthesis: Per-part phase representations and mixture-of-experts architectures allow for local, limb-level manipulation and control, unlocking expressive motion editing (Dai et al., 11 Mar 2025).
Physics-Driven Adaptation: Two-stage pipelines combining diffusion-based generative models with physics-based adaptation extend scalable inbetweening to physically diverse and stylistically distinct characters (Qin, 13 Apr 2025).
Benchmarking and Standardization: There is recognition of the need for more rigorous, standardized datasets, benchmarks, and physically grounded metrics to comprehensively evaluate and compare new approaches (Akhoundi et al., 9 Jun 2025).

Broader implications include the democratization of high-quality character animation for small studios and interactive applications, rapid content generation in virtual worlds, and improved generalization to diverse settings (from animation to robotics and AR/VR systems).

In summary, motion inbetweening has evolved from classical curve-based and motion-graph methods to advanced, data-driven diffusion models and transformer architectures capable of diverse, controllable, and physically plausible motion generation under sparse, noisy, or semantically complex keyframe constraints. Contemporary research emphasizes flexibility, user control, scene/context-awareness, and scalability across characters, morphologies, and sensor modalities, marking motion inbetweening as an area of intensive and multifaceted progress in computer animation and embodied AI.