Task-Space Whole-Body Control (WBC)

Updated 3 February 2026

Task-Space Whole-Body Control (WBC) is a framework that maps motion capture data onto diverse target skeletons while ensuring semantic retention and physical plausibility.
It employs hybrid methodologies, including skeleton- and geometry-aware architectures, disentanglement techniques, and optimization of energy functions to achieve high-fidelity retargeting.
Recent advances integrate vision-language models and reinforcement learning to enhance semantic alignment and dynamic constraints in applications like animation, robotics, and telepresence.

Motion capture retargeting is the task of transferring a captured motion sequence from a source character, often with an arbitrary skeleton or geometry, to a target character with potentially different morphology, topology, or physical constraints. The objective is to generate a target motion that preserves the semantics, style, and physical validity of the source while respecting the target's structural or dynamic idiosyncrasies. This challenge underpins applications in character animation, digital avatars, robotics, telepresence, and cross-species motion analysis.

1. Problem Formulation and Principles

The core challenge is to define a mapping $f: (\text{source motion}, \text{source skeleton}, \text{target skeleton}) \mapsto \text{target motion}$ that preserves semantic intent and physical plausibility, despite possibly drastic differences in skeleton structure, mesh geometry, or dynamical constraints.

A typical source motion may be represented as joint angle trajectories ( $\mathbf Q_A \in \mathbb{R}^{T \times N_A \times d}$ with $N_A$ joints and representation dimension $d$ ), surface meshes, 2D/3D keypoint tracks, or contact events. The target may differ in skeleton topology ( $N_B \neq N_A$ , different kinematic trees), bone lengths, skinning geometry, or degrees of freedom (e.g., non-humanoid robots, animals, or prosthetic hands).

Recent frameworks formalize this as either (a) a learning problem—train a model to infer the mapping given paired/unpaired data, or (b) an optimization—define energy functions or constraints encoding structure-preservation, semantics, contact, or physics, and solve for the best-matching target motion.

2. Methodological Advances

The past years have witnessed a proliferation of methods designed to address both skeletal and geometric correspondence, semantics, and physical constraints.

2.1 Skeleton- and Geometry-Aware Architectures

Approaches such as "Skeleton-Aware Networks" introduce differentiable pooling/unpooling operators that map motions of homeomorphic skeletons into and out of a shared latent space, facilitating unpaired retargeting between characters with similarly structured but differently sampled kinematic chains (Aberman et al., 2020). Primal skeleton construction reduces the skeletons to common graphs, and skeleton-aware convolutions exploit hierarchical structure for temporal feature extraction.

To address contact, penetration, and interaction fidelity, methods such as MeshRet develop dense mesh interaction (DMI) fields grounded by semantically consistent sensors, capturing pairwise spatial relations between mesh regions and aligning these fields during retargeting (Ye et al., 2024). This ensures both contact preservation and reduced interpenetration, outperforming skeleton-only approaches in geometric accuracy and user preference.

2.2 Disentanglement and Invariance

A recurring design is the explicit disentanglement of motion, structure, and view. Invariance-driven models such as TransMoMo employ auto-encoders and invariance losses to separate motion codes from skeleton and camera, enabling motion transfer across large body-shape and view disparities without requiring paired data (Yang et al., 2020). Canonicalization operations in MoCaNet further refine this separation, enforcing that the reconstructed canonical skeleton is invariant to structure or view perturbations, which drastically reduces 3D joint MSE in synthetic and in-the-wild settings (Zhu et al., 2021).

Personalized face/hand retargeting architectures use deformation spaces and learned corrective fields that decouple identity/expression (blendshape) or functional contact (atlas-based non-isometric shape matching), resulting in accurate transfer of nuanced, individualized dynamics (Chaudhuri et al., 2020, Lakshmipathy et al., 2024).

2.3 Semantics and Vision-LLMs

Preserving high-level motion semantics, beyond joint-space alignment, demands supervision at a semantic level. Recent advances exemplified by "Semantics-aware Motion reTargeting" (SMT) employ off-the-shelf vision-LLMs (VLMs) (e.g., BLIP-2) to extract semantic embeddings from rendered images of retargeted motions. By minimizing the embedding distance between source and target across corresponding frames, the pipeline preserves not only motion details but semantic intent ("what is the character doing") (Zhang et al., 2023). This approach, combined with geometric (e.g., interpenetration) losses, yields state-of-the-art tradeoffs in MSE, contact-fidelity, and semantic alignment.

2.4 Skeleton-Agnostic and Part-based Approaches

To address non-homeomorphic or arbitrary-topology targets, skeleton-agnostic encoders (e.g., PALUM) partition joints into semantic groups (e.g., torso, limbs, head) and use part-wise and cross-part attention pooled into a unified latent, followed by target-aware decoding (Liu et al., 12 Jan 2026). Cycle consistency is used to promote semantic preservation, and the approach generalizes across diverse skeletons (e.g., humanoids, animals, robots), outperforming prior masked-transformer and per-part processing methods in both intra- and cross-structural settings.

2.5 Physical Plausibility and Reinforcement Learning

Physics-based retargeting leverages simulation and reinforcement learning to ensure dynamic feasibility. Imitation learning is guided by reward terms enforcing not only kinematic fidelity (joint angles, velocities, end-effector positions) but also contact events and action smoothness (Reda et al., 2023 Zhang et al., 2023 Yoon et al., 2024). Multi-character scenarios are modeled with interaction graphs—edge-based rewards over key marker pairs (across and within bodies)—to preserve interplay, balance, and coordinated social cues (e.g., high-fives, assists, joint-carrying) (Zhang et al., 2023).

3. Objective Functions and Training

Retargeting systems optimize composite objectives comprising several loss terms:

Reconstruction Losses: Enforce self-consistency (reproduce original when source=target) and cycle-consistency (A→B→A).
Adversarial Losses: Instantaneous (frame-based) or temporal (sequence-wise) discriminators judge plausibility, promoting realistic dynamics (Yang et al., 2020 Liu et al., 12 Jan 2026 Zhao et al., 2023).
Semantic Alignment: Semantic embedding alignment with VLMs (Zhang et al., 2023) or canonical space embedding (Zhu et al., 2021).
Contact and Interpenetration: Direct geometric penalties for violation of contacts, or for penetration events (Ye et al., 2024 Villegas et al., 2021).
Skeleton and Geometry Preservation: Joint-distance matrix or DMI alignment (Ye et al., 2024 Aberman et al., 2020).
Physically Motivated Terms: Reward functions in RL—imitation, contact, root, CoM tracking (Reda et al., 2023 Zhang et al., 2023).
Regularization: Velocity smoothness, Laplacian smoothing, constraint guarantees via nonparametric regression in robot retargeting (Choi et al., 2021).

4. Benchmark Datasets, Metrics, and Validation

Evaluation employs datasets such as Mixamo, ScanRet, Truebones Zoo, AMASS, and GRAB, covering synthetic, real, and in-the-wild data (Zhang et al., 2023 Ye et al., 2024 Gong et al., 11 Dec 2025 Lakshmipathy et al., 2024). Metric choice depends on modality:

Metric	Application	Definition/Detail
Skeleton MSE	Global and local joints	$\frac{1}{h_B}\\|\hat P_B - P_B\\|_2^2$ (Zhang et al., 2023)
Contact Error (CE)	Contact preservation	Vertex-based deviation from contact, only if original had contact
Penetration (PEN%)	Physical plausibility	Fraction of penetrated verts (Zhang et al., 2023 Ye et al., 2024)
Semantic (ITM, FID, SCL)	Semantic match	Image-text matching, FID on embeddings, SCL as embedding MSE
MPJPE, CD-Skeleton	Skeletal pose accuracy	Mean per-joint position error, chamfer L1/L2 over skeleton points
User Study/Binary Preference	Perceptual evaluation	Mean human preference over baselines (overall, semantics, artifacts)

Recent works systematically report ablations—removal of canonicalization, semantic or geometric penalization, one-stage vs two-stage training—demonstrating that omitting key disentanglement or contact losses leads to statistical and perceptual degradation (Zhang et al., 2023 Ye et al., 2024 Liu et al., 12 Jan 2026).

5. Specialized Domains and Extensions

Motion capture retargeting operates not only for full bodies but also for specialized domains:

Hand-object manipulation: Contact-rich retargeting via non-isometric matching, chart atlases, and constrained optimization supports diverse hands (human, alien, prosthetic) and object substitution (Lakshmipathy et al., 2024).
Face retargeting: Personalized 3DMMs with learned geometric and albedo corrections enhance the expression transfer and expression-specific reflectance for arbitrary individuals (Chaudhuri et al., 2020).
Robot retargeting: MR HuBo and S3LE combine robot-to-human pose pairing, human body priors, and projection-invariant latent spaces to yield safe, high-fidelity mapping even without direct human-to-robot paired data, supported by filtering and nonparametric regression (Figuera et al., 2024 Choi et al., 2021).
Category-agnostic motion capture: MoCapAnything implements prompt-driven retargeting from monocular video to arbitrary rigged assets, employing a factorized architecture, per-joint referencing, and hybrid analytical/numerical IK for high anatomical plausibility (Gong et al., 11 Dec 2025).

6. Limitations and Future Directions

Key limitations include:

Reliance on 2D vision-LLMs or 2D projections may induce loss of 3D semantics (Zhang et al., 2023).
Skeleton-agnostic models may not fully capture differences in kinematic chain lengths or non-humanoid topologies (Liu et al., 12 Jan 2026).
Dynamic constraints are often only indirectly enforced; explicit physics-aware or differentiable simulation-guided retargeting remains an open problem (Reda et al., 2023).
Generalization to multi-character, non-rigid, or deformable bodies, and end-to-end semantic supervision at the mesh or video level are active areas.

Suggested future advances include the incorporation of 3D VLMs for direct mesh/point-cloud semantic supervision, richer temporal semantic models, improved chain structure encoding, and fully differentiable sim-to-real pipelines that directly optimize for anatomical and physical plausibility (Zhang et al., 2023 Gong et al., 11 Dec 2025 Liu et al., 12 Jan 2026).

7. Impact and Outlook

Motion capture retargeting has evolved from analytical kinematic mapping to highly expressive, physically plausible, and semantically-preserving methods that support arbitrary morphologies and modalities. State-of-the-art systems combine skeleton- and mesh-level reasoning, self-supervised semantic learning, dense contact and physical constraint modeling, and transformer-based part-agnostic architectures. The field continues to advance toward unified, category-agnostic, and semantics-aware frameworks suitable for emerging applications in animation, robotics, biomechanics, and cross-modal interaction.

Key references: (Zhang et al., 2023, Ye et al., 2024, Yang et al., 2020, Liu et al., 12 Jan 2026, Reda et al., 2023, Gong et al., 11 Dec 2025, Zhu et al., 2021, Chaudhuri et al., 2020, Lakshmipathy et al., 2024, Rekik et al., 2023, Villegas et al., 2021, Mokady et al., 2021, Figuera et al., 2024, Zhao et al., 2023).