Papers
Topics
Authors
Recent
2000 character limit reached

Robust Motion In-betweening

Updated 3 December 2025
  • Robust motion in-betweening is the process of generating smooth, physically plausible, and semantically coherent motion sequences between specified keyframes for articulated characters.
  • It leverages advanced generative models—including diffusion, transformer, adversarial, and neural-field frameworks—to manage diverse motion styles, sparse inputs, and noisy conditions.
  • Robustness is maintained through explicit keyframe conditioning, dynamic loss functions, and scene-aware descriptors, enabling realistic transitions in animation, simulation, and AI applications.

Robust motion in-betweening is the task of synthesizing temporally smooth, physically plausible, and semantically coherent sequences that connect sparse, user- or system-specified keyframes in articulated character motion. This process is fundamental to animation, simulation, and embodied AI, enabling both interactive editing and the interpolation of data-driven motion. Recent advances have focused on learning-based generative models—primarily diffusion, transformer, adversarial, and neural-field frameworks—that can robustly handle a wide range of contexts, motion styles, noisy keyframe inputs, and environmental constraints. Robustness here denotes consistent performance under diverse scene types, motion complexities, sparse keyframes, training-testing domain shifts, and input corruptions.

1. Problem Formulation, Motion Representation, and Key Challenges

The robust motion in-betweening problem is formalized as the conditional generation of a sequence x1:Tx_{1:T} given a sparse set of LL input keyframes c={xi1,...,xiL}c = \{x_{i_1}, ..., x_{i_L}\} at time indices cic_i, with xiRFx_{i} \in \mathbb{R}^F representing a pose combining global translation and joint features. For each frame, the pose is most commonly parameterized as a tuple comprising the 3D root position and joint orientations in a continuous 6D representation (Adewole et al., 10 Sep 2024), though alternatives include axis-angle, quaternions, or full SMPL-X body models (Xu et al., 22 Oct 2025). The sequence is required to match the specified keyframes exactly at constrained timesteps and interpolate missing frames via plausible, temporally coherent, and stylistically appropriate transitions.

Key challenges include:

  • Sparsity and length of context: In practice, keyframes can be widely spaced (long gaps), or context may be as little as just the endpoints.
  • Physical realism: Outputs must maintain foot contact, avoid jitter or sliding, and respect kinematic and dynamic constraints.
  • Style and multimodality: For a given set of keyframes, multiple valid transitions may exist.
  • Generalization and robustness: Systems must operate consistently for unseen subjects, out-of-distribution motions, or when inputs are noisy or only partial (Hwang et al., 20 Mar 2025).

2. Generative Modeling Approaches for Motion In-betweening

Robust in-betweening architectures span a range of generative paradigms.

Diffusion Models

Recent advances have shown diffusion models with explicit conditioning mechanisms are highly effective. The standard Markov noising process is used, parameterized as

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}\big(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I\big)

with the reverse model trained to predict the clean signal or the added noise (Adewole et al., 10 Sep 2024, Cohan et al., 17 May 2024, Pinyoanuntapong et al., 2023, Xu et al., 22 Oct 2025). Conditioning is performed via "hard imputation" at each step by replacing masked (observed) context positions with the known keyframe values, and decoding is performed with either U-Net or Transformer-based denoisers.

Transformer-based Feed-forward Models

Encoder-only Transformers use self-attention to propagate long-range dependencies within the motion sequence. SILK demonstrates that a single 6-layer Transformer, with relative pose and velocity features plus root-aligned and 6D rotation representations, achieves SOTA in-betweening without multi-stage or skeleton-aware modifications (Akhoundi et al., 9 Jun 2025), provided sufficient data and appropriate pose modeling.

Latent and Neural-Field Models

Variational models (NeMF, NRMF) learn implicit motion manifolds:

  • Neural Motion Fields map continuous time tt and a latent vector zz to pose spaces, with zz optimized at inference to satisfy keyframe or style constraints, enabling robust interpolation even with very sparse observations (He et al., 2022).
  • Neural Riemannian Motion Fields introduce neural distance fields for pose, velocity, and acceleration, enforcing proximity to a learned motion manifold at test time via projection and geometric integration (Yu et al., 11 Sep 2025). This regularizes higher-order dynamics, reducing drift and oversmoothing.

Adversarial and Mixture-of-Experts Architectures

Adversarial recurrent networks combine LSTM generators with hybrid losses (reconstruction and LSGAN) and use explicit time-to-arrival and additive target noise embeddings for variable-length, robust transitions (Harvey et al., 2021). Mixture-of-experts with phase manifolds or gating on style/time-to-arrive improve diversity and controllability (Starke et al., 2023, Chu et al., 30 Sep 2024).

3. Conditioning and Robustness Mechanisms

Crucial to robustness is the strategy by which model inputs are mapped to desired constraints during both training and inference:

Keyframe and Mask Conditioning

Robust architectures use explicit binary masks mm to signal which elements in the sequence are constrained at each timestep, with the masked values "imputed" at every denoising or prediction step (Cohan et al., 17 May 2024, Hwang et al., 20 Mar 2025). This guarantees exact keyframe reconstruction, independent of temporal density or spatial sparsity.

Domain- and Task-specific Conditioning

  • Scene-aware descriptors: For human-scene interaction, dual descriptors—coarse global (occupancy grid via ViT) and local per-keyframe (surface anchor offsets)—integrate environmental constraints directly into model inputs (Hwang et al., 20 Mar 2025).
  • Reference motion tokens: Whole clips or partial reference sequences (e.g., endpoints, trajectory samples) are encoded and concatenated in transformer-based systems to enforce long-range temporal structure (Xu et al., 22 Oct 2025).
  • Style, duration, and path: Temporal and stylistic control is implemented by directly injecting time-to-arrival (TTA) embeddings, one-hot or continuous style vectors, and explicit root-trajectory windows, into MoE gating networks for versatile user control (Chu et al., 30 Sep 2024, Dai et al., 11 Mar 2025).

Data and Loss Strategies

  • Delta-mode (Δ) architectures: Models operating in a local delta-regime—predicting only residuals with respect to a simple baseline (such as SLERP or zero-velocity)—are empirically more robust, generalizing well across global frame shifts or missing normalization steps (Oreshkin et al., 2022).
  • Higher-order loss terms: Imposing losses on velocities, accelerations, and global position power spectra (NPSS) reduces jitter, foot-sliding, and preserves long-term coherence (Yu et al., 11 Sep 2025, Dai et al., 11 Mar 2025).

Handling Noisy or Partial Keyframes

Diffusion-based systems using imputation and denoising can robustly recover clean motions even when input keyframes are subjected to Gaussian noise, as in the SceneMI evaluation, which demonstrates substantial reduction in FID and mean point error with noise-aware sampling strategies (Hwang et al., 20 Mar 2025).

4. Quantitative Evaluation and Benchmarks

Evaluation protocols standardize on several quantitative metrics across datasets:

Metric Description
FID (Fréchet Distance) Distance between embedding distributions of real vs. generated motions
Diversity Mean pairwise distance among generated samples
Multimodality Diversity among outputs for fixed keyframe sets
NPSS Normalized Power Spectrum Similarity for temporal smoothness
L2Q, L2P MAE in global joint angles (quaternion), global 3D joint positions
Foot-skating % of frames with excessive foot slide while in contact
Keyframe error Mean L2 error at constrained frames

Key quantitative results include:

  • Diffusion-Transformer approach: FID as low as 0.392 (CMU, 20 keys), global diversity 2.99, and strong multimodality (up to 0.488) (Adewole et al., 10 Sep 2024).
  • Neural Motion Fields: lowest FID and foot-skating among neural-field and auto-regressive baselines for both short and long in-between intervals (He et al., 2022).
  • Conditional Diffusion (CondMDI): state-of-the-art FID (0.17–0.25), keyframe error ($0.08$–$0.37$ for increasing keyframes), and low foot-skating (\sim0.08) (Cohan et al., 17 May 2024).
  • Scene-aware diffusion (SceneMI): on real IMU/video data, foot-skating reduced from $0.261$ to $0.163$, FID down to $0.118$ (vs. $3.136$ in strong baselines) (Hwang et al., 20 Mar 2025).
  • Mixture-of-Experts and Phase Models: L2P error 3.32 cm, robust foot-skating of 0.46 m/s across extreme transitions (Starke et al., 2023, Chu et al., 30 Sep 2024).

Practically, robustness is further validated by zero-shot subject/action generalization, handling of partial/VR-tracker inputs, smoothness in challenging kinematic contexts, and resistance to catastrophic failure under domain shift.

5. System Limitations and Failure Modes

Current methods exhibit several limitations:

  • Fixed output length: Many architectures (notably (Adewole et al., 10 Sep 2024)) assume a fixed sequence length due to model or data slicing, making arbitrary-duration in-betweening nontrivial without architectural modifications.
  • Sensitivity to keyframe sparsity: Performance degrades if the number of keyframes is extremely low (c<10|c|<10) or if the temporal gap is much larger than training.
  • Unrealistic/contradictory keys: Incoherent, physically implausible, or mutually inconsistent keyframes may yield implausible motion transitions.
  • Inference speed: Iterative diffusion steps (e.g., 300–1000 steps) can be slow, though DDIM sampling or progressive distillation offers acceleration at the cost of potential quality loss (Adewole et al., 10 Sep 2024).

Additional limitations in specific approaches include suboptimal handling of long chains in nonhuman skeletons (e.g., SinMDM and serpentine morphologies (Raab et al., 2023)) and lack of physical or biomechanical priors (e.g., MMM, (Pinyoanuntapong et al., 2023)).

6. Applications and Future Directions

Robust motion in-betweening now underpins interactive animation tools, zero-shot character retargeting, motion refinement from noisy sensors or video, and context-aware synthesis in AR/VR and robotics:

  • Interactive animation: Efficient in-betweeners integrated as plugins (e.g., MotionBuilder) can provide instant transition generation with user control over keyframes, style, and variation (Harvey et al., 2021, Chu et al., 30 Sep 2024).
  • Human-scene interaction: SceneMI leverages global/local descriptors to in-paint missing HSI data or reconstruct motion from monocular video (Hwang et al., 20 Mar 2025).
  • Nonhuman/creature retargeting: Single-motion and video-diffusion models enable high-quality interpolation for rigid, flexible, and exotic anatomies without large training datasets (Raab et al., 2023, Yun et al., 11 Mar 2025).

Future work aims at variable length/recursive synthesis via cascaded diffusion, multi-modal conditioning (e.g., combining text, spatial, and partial body controls), integration of explicit physics constraints for higher-order smoothness, and real-time inference acceleration. Hybrid models that combine the efficiency of shallow feed-forward architectures with the controllability of mask-conditioning and diffusion processes remain an active research direction.

7. Comparative Summary Table

Methodology Conditioning/Control Robustness Mode Sample Evaluation Result Key Reference
Diffusion + Transformer Masked keys, indices Generalizes across gaps, actions FID 0.392, Diversity 2.99 (Adewole et al., 10 Sep 2024)
Conditional Diffusion (CondMDI) Mask + Text, arbitrary mask Partial/joint keys, multi-modality FID 0.17, Keyframe err 0.179 (Cohan et al., 17 May 2024)
Phase MoE/DeepPhase Phase, style, TTA, path Extreme duration/style/path control L2P 3.91 cm, real-time (Chu et al., 30 Sep 2024, Starke et al., 2023)
Neural Motion Fields Latent z, time, global root Latent optimization for keys/styles FID 0.024–0.365, FS 0.646–0.660 (He et al., 2022)
Scene-aware Diffusion Scene occupancy, anchors Handles noisy keys, HSI refinement FID 0.123, MPJPE 0.023 m (Hwang et al., 20 Mar 2025)
Adversarial Recurrent TTA, scheduled noise Robust to context, stochastic paths L2Q 0.17, real-time plugin demo (Harvey et al., 2021)
Delta-mode Transformer Residual to SLERP/ZV Global shift/domain-robust L2P 1.00 vs. SSMCT 1.10 (Oreshkin et al., 2022)
Masked Token Transformer [MASK] in-between Fast, flexible editing FID 0.071 (text), Diversity 9.579 (Pinyoanuntapong et al., 2023)
Autoregressive GAN + DNA Motion DNA, 2-stage GAN Long-term, diverse, user-guided <10 cm keyframe error, user paper (Zhou et al., 2020)
NDF/Riemannian Fields Higher-order velocity/accel Physically plausible, drift-free FID 1.89 (vs. 2.04–3.42 baselines) (Yu et al., 11 Sep 2025)

These methods collectively delineate the state of the art in robust motion in-betweening, each offering architectural trade-offs for domain, latency, degree of explicit control, and breadth of robustness.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Robust Motion In-betweening.