TransMoMo: Dense Mesh Motion Retargeting
- TransMoMo is a framework that directly uses dense mesh interactions and semantically aligned sensors to retarget skinned motion with precise contact preservation.
- It employs a dense mesh interaction field that encodes both local and global geometric relationships to ensure accurate spatio-temporal alignment.
- Its unified architecture integrates geometry processing, unsupervised loss objectives, and Transformer networks to overcome common two-stage retargeting challenges.
MeshRet is a framework for skinned motion retargeting that directly leverages dense geometric interactions between character meshes rather than relying on the conventional two-stage paradigm of skeleton retargeting followed by geometry correction. It introduces semantically consistent sensors (SCS) to enable dense, semantically aligned mesh correspondences between characters and employs a dense mesh interaction (DMI) field to represent and align spatio-temporal geometric relationships. This unified strategy enforces preservation of contact and non-contact relations, automatically precluding self-interpenetration and ensuring contact fidelity during motion transfer, as demonstrated by state-of-the-art results on both synthetic and real human datasets (Ye et al., 2024).
1. Motion Retargeting Paradigms and Challenges
Motion retargeting historically involves mapping source character motions onto new target rigged meshes. Traditional systems adopt a two-stage process: first retargeting skeletal motion, followed by a geometry correction phase that seeks to resolve emerging artifacts such as self-interpenetration or contact mismatch. This staged approach introduces conflicts—geometry corrections enacted post hoc can disrupt nuanced interactions governed by the underlying skeleton, leading to issues with jitter, interpenetration, and loss of meaningful contacts.
MeshRet departs from this standard by reasoning directly on geometry throughout the retargeting task. Rather than decoupling skeleton and surface, it maintains a persistent representation of geometric relationships, ensuring that fine-grained contacts and non-contact interactions are respected in the transferred motion. This is achieved via the direct modeling of dense mesh-to-mesh interaction cues, eliminating the separation between skeletal and geometric processing.
2. Semantically Consistent Sensors (SCS)
Central to MeshRet is the extraction of semantically consistent sensor (SCS) features across diverse character topologies. For any template mesh and skeletal joint configuration , a global vocabulary of semantic coordinates is defined by triplets , where indicates the bone index, the normalized longitudinal distance on the medial axis, and an angular coordinate orthogonal to the bone. For each , a ray from the bone axis is cast into the mesh to obtain an intersection position and an associated tangent frame .
Each valid sensor is thus , with the total set:
0
Because the triplet parameterization is consistent across character meshes, SCS induce dense, semantically aligned correspondences even for meshes with heterogeneous topology. This alignment underpins robust transfer of geometric interaction patterns.
3. Dense Mesh Interaction (DMI) Field
The DMI field encodes dense, local and global geometric interactions among mesh surface locations as tracked by SCS. For time frame 1, SCS positions and tangents are updated via forward kinematics using linearly blended skinning:
2
where 3 are LBS weights, 4 the SE(3) bone transforms.
For each “observation” sensor 5, and each of 6 (“nearest” and “farthest”) “target” sensors 7, the following relative vector in the local tangent frame is computed:
8
A sparse binary mask 9 selects valid (observation, target) pairs, reducing the computational cost. The per-frame DMI 0 is thus:
1
Stacking temporally yields the full DMI field 2.
By aligning the DMI fields of the source and target, MeshRet enforces the preservation of dense inter-sensor relations, effectively encoding both contact and spatial separation.
4. Training Objective and Optimization
MeshRet is trained with unsupervised, cycle-consistency-inspired loss functions as paired motion data are generally unavailable for heterogeneous mesh topologies. The optimization targets include:
- DMI Consistency Loss: Ensures geometric interactions are faithfully transferred,
3
where 4 is an existence mask.
- Reconstruction Loss: Pose stability, penalizing deviation from source pose.
5
- Adversarial Loss: Employs a discriminator 6 to enforce plausibility via a motion prior.
- End-Effector Orientation Loss: Encourages preservation of orientation for semantic end-effectors.
The overall objective is a weighted sum:
7
This loss formulation ensures that the transferred target motion 8 not only matches pose fidelity but also preserves both superficial and deep mesh interaction characteristics.
5. Network Architecture
MeshRet’s architecture incorporates both deterministic geometry processing modules and trainable deep networks:
- SCS and FK Extraction: Computed deterministically from input mesh and skeleton.
- DMI Encoder: Each of the 9 local sensor point-clouds (0 points each) is embedded by a per-sensor 6-layer PointNet, producing 1. Aggregation across sensors forms a per-frame feature 2.
- Geometry Encoder: A PointNet over static SCS features yields global geometry embeddings for both source and target.
- Retargeting Network: An 8-layer Transformer encoder (4 heads, 3) operates on concatenated frame-wise features and target geometry. A Transformer decoder with cross-attention incorporates sequence pose and source geometry. Output is the motion 4 on the target mesh.
Frame-to-frame alignment is enforced with explicit attention masking, ensuring spatio-temporal correspondence.
6. Handling Self-Interpenetration and Contact
By never directly operating on mesh vertices and instead enforcing tangent-space relative vectors 5, MeshRet penalizes self-intersection implicitly. Penetration alters the sign and/or magnitude of these vectors, thereby increasing DMI consistency loss. Contact preservation emerges because sensor pairs in initial contact (distance below a threshold) are required to remain close to minimize the loss function. As a result, MeshRet unifies contact and collision-avoidance behavior through its geometric interaction alignment mechanism rather than as separate post-processing stages.
7. Evaluation and Empirical Results
Experiments are conducted on the Mixamo dataset (3,675 motion clips, 13 characters) and the ScanRet dataset (8,298 real motion clips on 100 scanned meshes). Assessments use mean squared error (MSE) global/local (normalized by height), contact error (mean-squared increase for contacting pairs), and penetration percentage (arm vertex interpenetration rate per frame).
| Method | MSE↓ | MSE6↓ | Contact↓ | Penetration↓ |
|---|---|---|---|---|
| Copy | 0.026 | 0.006 | 1.702/0.387 | 5.26/2.16 |
| PMnet | 0.130 | 0.029 | 2.716/0.890 | 5.23/2.23 |
| SAN | 0.049 | 0.011 | 2.432/0.627 | 4.95/1.72 |
| R7ET | 0.063 | 0.017 | 2.209/0.589 | 4.21/2.01 |
| Ours | 0.047 | 0.009 | 0.772/0.284 | 3.45/1.59 |
On both Mixamo and ScanRet, MeshRet at least halves interpenetration and contact error over the closest baselines across all test splits (UC+UM, UC+SM, SC+UM, SC+SM). Pairwise user studies (600 trials) exhibit a preference exceeding 80% for MeshRet on semantic preservation, contact accuracy, and overall quality (Ye et al., 2024).
8. Principal Contributions and Impact
MeshRet’s primary innovation is the introduction of the dense mesh interaction field constructed atop semantically consistent sensors, enabling a single-stage, end-to-end optimization of skinned motion retargeting that preserves detailed geometric relations. By directly aligning DMI fields, it achieves skeleton- and geometry-aware transfer, high-fidelity contact preservation, and automatic avoidance of mesh interpenetration in a unified neural architecture. This framework establishes new benchmarks for contact and collision handling in retargeting tasks and provides a template for future approaches leveraging dense geometric cues in human motion processing (Ye et al., 2024).