Robust Motion In-betweening

Updated 3 December 2025

Robust motion in-betweening is the process of generating smooth, physically plausible, and semantically coherent motion sequences between specified keyframes for articulated characters.
It leverages advanced generative models—including diffusion, transformer, adversarial, and neural-field frameworks—to manage diverse motion styles, sparse inputs, and noisy conditions.
Robustness is maintained through explicit keyframe conditioning, dynamic loss functions, and scene-aware descriptors, enabling realistic transitions in animation, simulation, and AI applications.

Robust motion in-betweening is the task of synthesizing temporally smooth, physically plausible, and semantically coherent sequences that connect sparse, user- or system-specified keyframes in articulated character motion. This process is fundamental to animation, simulation, and embodied AI, enabling both interactive editing and the interpolation of data-driven motion. Recent advances have focused on learning-based generative models—primarily diffusion, transformer, adversarial, and neural-field frameworks—that can robustly handle a wide range of contexts, motion styles, noisy keyframe inputs, and environmental constraints. Robustness here denotes consistent performance under diverse scene types, motion complexities, sparse keyframes, training-testing domain shifts, and input corruptions.

1. Problem Formulation, Motion Representation, and Key Challenges

The robust motion in-betweening problem is formalized as the conditional generation of a sequence $x_{1:T}$ given a sparse set of $L$ input keyframes $c = \{x_{i_1}, ..., x_{i_L}\}$ at time indices $c_i$ , with $x_{i} \in \mathbb{R}^F$ representing a pose combining global translation and joint features. For each frame, the pose is most commonly parameterized as a tuple comprising the 3D root position and joint orientations in a continuous 6D representation (Adewole et al., 10 Sep 2024), though alternatives include axis-angle, quaternions, or full SMPL-X body models (Xu et al., 22 Oct 2025). The sequence is required to match the specified keyframes exactly at constrained timesteps and interpolate missing frames via plausible, temporally coherent, and stylistically appropriate transitions.

Key challenges include:

Sparsity and length of context: In practice, keyframes can be widely spaced (long gaps), or context may be as little as just the endpoints.
Physical realism: Outputs must maintain foot contact, avoid jitter or sliding, and respect kinematic and dynamic constraints.
Style and multimodality: For a given set of keyframes, multiple valid transitions may exist.
Generalization and robustness: Systems must operate consistently for unseen subjects, out-of-distribution motions, or when inputs are noisy or only partial (Hwang et al., 20 Mar 2025).

2. Generative Modeling Approaches for Motion In-betweening

Robust in-betweening architectures span a range of generative paradigms.

Diffusion Models

Recent advances have shown diffusion models with explicit conditioning mechanisms are highly effective. The standard Markov noising process is used, parameterized as

$q(x_t \mid x_{t-1}) = \mathcal{N}\big(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I\big)$

with the reverse model trained to predict the clean signal or the added noise (Adewole et al., 10 Sep 2024, Cohan et al., 17 May 2024, Pinyoanuntapong et al., 2023, Xu et al., 22 Oct 2025). Conditioning is performed via "hard imputation" at each step by replacing masked (observed) context positions with the known keyframe values, and decoding is performed with either U-Net or Transformer-based denoisers.

Transformer-based Feed-forward Models

Encoder-only Transformers use self-attention to propagate long-range dependencies within the motion sequence. SILK demonstrates that a single 6-layer Transformer, with relative pose and velocity features plus root-aligned and 6D rotation representations, achieves SOTA in-betweening without multi-stage or skeleton-aware modifications (Akhoundi et al., 9 Jun 2025), provided sufficient data and appropriate pose modeling.

Latent and Neural-Field Models

Variational models (NeMF, NRMF) learn implicit motion manifolds:

Neural Motion Fields map continuous time $t$ and a latent vector $z$ to pose spaces, with $z$ optimized at inference to satisfy keyframe or style constraints, enabling robust interpolation even with very sparse observations (He et al., 2022).
Neural Riemannian Motion Fields introduce neural distance fields for pose, velocity, and acceleration, enforcing proximity to a learned motion manifold at test time via projection and geometric integration (Yu et al., 11 Sep 2025). This regularizes higher-order dynamics, reducing drift and oversmoothing.

Adversarial and Mixture-of-Experts Architectures

Adversarial recurrent networks combine LSTM generators with hybrid losses (reconstruction and LSGAN) and use explicit time-to-arrival and additive target noise embeddings for variable-length, robust transitions (Harvey et al., 2021). Mixture-of-experts with phase manifolds or gating on style/time-to-arrive improve diversity and controllability (Starke et al., 2023, Chu et al., 30 Sep 2024).

3. Conditioning and Robustness Mechanisms

Crucial to robustness is the strategy by which model inputs are mapped to desired constraints during both training and inference:

Keyframe and Mask Conditioning

Robust architectures use explicit binary masks $m$ to signal which elements in the sequence are constrained at each timestep, with the masked values "imputed" at every denoising or prediction step (Cohan et al., 17 May 2024, Hwang et al., 20 Mar 2025). This guarantees exact keyframe reconstruction, independent of temporal density or spatial sparsity.

Domain- and Task-specific Conditioning

Scene-aware descriptors: For human-scene interaction, dual descriptors—coarse global (occupancy grid via ViT) and local per-keyframe (surface anchor offsets)—integrate environmental constraints directly into model inputs (Hwang et al., 20 Mar 2025).
Reference motion tokens: Whole clips or partial reference sequences (e.g., endpoints, trajectory samples) are encoded and concatenated in transformer-based systems to enforce long-range temporal structure (Xu et al., 22 Oct 2025).
Style, duration, and path: Temporal and stylistic control is implemented by directly injecting time-to-arrival (TTA) embeddings, one-hot or continuous style vectors, and explicit root-trajectory windows, into MoE gating networks for versatile user control (Chu et al., 30 Sep 2024, Dai et al., 11 Mar 2025).

Data and Loss Strategies

Delta-mode (Δ) architectures: Models operating in a local delta-regime—predicting only residuals with respect to a simple baseline (such as SLERP or zero-velocity)—are empirically more robust, generalizing well across global frame shifts or missing normalization steps (Oreshkin et al., 2022).
Higher-order loss terms: Imposing losses on velocities, accelerations, and global position power spectra (NPSS) reduces jitter, foot-sliding, and preserves long-term coherence (Yu et al., 11 Sep 2025, Dai et al., 11 Mar 2025).

Handling Noisy or Partial Keyframes

Diffusion-based systems using imputation and denoising can robustly recover clean motions even when input keyframes are subjected to Gaussian noise, as in the SceneMI evaluation, which demonstrates substantial reduction in FID and mean point error with noise-aware sampling strategies (Hwang et al., 20 Mar 2025).

4. Quantitative Evaluation and Benchmarks

Evaluation protocols standardize on several quantitative metrics across datasets:

Metric	Description
FID (Fréchet Distance)	Distance between embedding distributions of real vs. generated motions
Diversity	Mean pairwise distance among generated samples
Multimodality	Diversity among outputs for fixed keyframe sets
NPSS	Normalized Power Spectrum Similarity for temporal smoothness
L2Q, L2P	MAE in global joint angles (quaternion), global 3D joint positions
Foot-skating	% of frames with excessive foot slide while in contact
Keyframe error	Mean L2 error at constrained frames

Key quantitative results include:

Diffusion-Transformer approach: FID as low as 0.392 (CMU, 20 keys), global diversity 2.99, and strong multimodality (up to 0.488) (Adewole et al., 10 Sep 2024).
Neural Motion Fields: lowest FID and foot-skating among neural-field and auto-regressive baselines for both short and long in-between intervals (He et al., 2022).
Conditional Diffusion (CondMDI): state-of-the-art FID (0.17–0.25), keyframe error ($0.08$–$0.37$ for increasing keyframes), and low foot-skating ( $\sim$ 0.08) (Cohan et al., 17 May 2024).
Scene-aware diffusion (SceneMI): on real IMU/video data, foot-skating reduced from $0.261$ to $0.163$, FID down to $0.118$ (vs. $3.136$ in strong baselines) (Hwang et al., 20 Mar 2025).
Mixture-of-Experts and Phase Models: L2P error 3.32 cm, robust foot-skating of 0.46 m/s across extreme transitions (Starke et al., 2023, Chu et al., 30 Sep 2024).

Practically, robustness is further validated by zero-shot subject/action generalization, handling of partial/VR-tracker inputs, smoothness in challenging kinematic contexts, and resistance to catastrophic failure under domain shift.

5. System Limitations and Failure Modes

Current methods exhibit several limitations:

Fixed output length: Many architectures (notably (Adewole et al., 10 Sep 2024)) assume a fixed sequence length due to model or data slicing, making arbitrary-duration in-betweening nontrivial without architectural modifications.
Sensitivity to keyframe sparsity: Performance degrades if the number of keyframes is extremely low ( $|c|<10$ ) or if the temporal gap is much larger than training.
Unrealistic/contradictory keys: Incoherent, physically implausible, or mutually inconsistent keyframes may yield implausible motion transitions.
Inference speed: Iterative diffusion steps (e.g., 300–1000 steps) can be slow, though DDIM sampling or progressive distillation offers acceleration at the cost of potential quality loss (Adewole et al., 10 Sep 2024).

Additional limitations in specific approaches include suboptimal handling of long chains in nonhuman skeletons (e.g., SinMDM and serpentine morphologies (Raab et al., 2023)) and lack of physical or biomechanical priors (e.g., MMM, (Pinyoanuntapong et al., 2023)).

6. Applications and Future Directions

Robust motion in-betweening now underpins interactive animation tools, zero-shot character retargeting, motion refinement from noisy sensors or video, and context-aware synthesis in AR/VR and robotics:

Interactive animation: Efficient in-betweeners integrated as plugins (e.g., MotionBuilder) can provide instant transition generation with user control over keyframes, style, and variation (Harvey et al., 2021, Chu et al., 30 Sep 2024).
Human-scene interaction: SceneMI leverages global/local descriptors to in-paint missing HSI data or reconstruct motion from monocular video (Hwang et al., 20 Mar 2025).
Nonhuman/creature retargeting: Single-motion and video-diffusion models enable high-quality interpolation for rigid, flexible, and exotic anatomies without large training datasets (Raab et al., 2023, Yun et al., 11 Mar 2025).

Future work aims at variable length/recursive synthesis via cascaded diffusion, multi-modal conditioning (e.g., combining text, spatial, and partial body controls), integration of explicit physics constraints for higher-order smoothness, and real-time inference acceleration. Hybrid models that combine the efficiency of shallow feed-forward architectures with the controllability of mask-conditioning and diffusion processes remain an active research direction.

7. Comparative Summary Table

Methodology	Conditioning/Control	Robustness Mode	Sample Evaluation Result	Key Reference
Diffusion + Transformer	Masked keys, indices	Generalizes across gaps, actions	FID 0.392, Diversity 2.99	(Adewole et al., 10 Sep 2024)
Conditional Diffusion (CondMDI)	Mask + Text, arbitrary mask	Partial/joint keys, multi-modality	FID 0.17, Keyframe err 0.179	(Cohan et al., 17 May 2024)
Phase MoE/DeepPhase	Phase, style, TTA, path	Extreme duration/style/path control	L2P 3.91 cm, real-time	(Chu et al., 30 Sep 2024, Starke et al., 2023)
Neural Motion Fields	Latent z, time, global root	Latent optimization for keys/styles	FID 0.024–0.365, FS 0.646–0.660	(He et al., 2022)
Scene-aware Diffusion	Scene occupancy, anchors	Handles noisy keys, HSI refinement	FID 0.123, MPJPE 0.023 m	(Hwang et al., 20 Mar 2025)
Adversarial Recurrent	TTA, scheduled noise	Robust to context, stochastic paths	L2Q 0.17, real-time plugin demo	(Harvey et al., 2021)
Delta-mode Transformer	Residual to SLERP/ZV	Global shift/domain-robust	L2P 1.00 vs. SSMCT 1.10	(Oreshkin et al., 2022)
Masked Token Transformer	[MASK] in-between	Fast, flexible editing	FID 0.071 (text), Diversity 9.579	(Pinyoanuntapong et al., 2023)
Autoregressive GAN + DNA	Motion DNA, 2-stage GAN	Long-term, diverse, user-guided	<10 cm keyframe error, user paper	(Zhou et al., 2020)
NDF/Riemannian Fields	Higher-order velocity/accel	Physically plausible, drift-free	FID 1.89 (vs. 2.04–3.42 baselines)	(Yu et al., 11 Sep 2025)

These methods collectively delineate the state of the art in robust motion in-betweening, each offering architectural trade-offs for domain, latency, degree of explicit control, and breadth of robustness.