MeshRet: Unified Motion Retargeting Framework
- The paper presents an end-to-end framework that retargets skinned character motions while optimizing dense geometric interactions.
- It replaces separate skeletal semantics and geometry correction with a unified sensor-based approach that minimizes self-interpenetration and contact errors.
- Empirical results demonstrate state-of-the-art performance on synthetic and real-scan datasets with significant improvements in contact accuracy and penetration metrics.
MeshRet is a unified skinned character motion retargeting framework that directly models dense geometric interactions between body parts through a spatio-temporal field representation. Unlike standard two-stage retargeting pipelines that separately handle skeletal semantics and geometry correction—often leading to conflicts manifesting as jitter, interpenetration, and contact errors—MeshRet achieves motion retargeting and geometric relationship preservation in an end-to-end manner by learning to align dense interaction statistics. The approach enables motion retargeting across diverse mesh topologies while minimizing both self-interpenetration and contact mismatch, and demonstrates state-of-the-art results on both synthetic (Mixamo) and real-scan (ScanRet) datasets (Ye et al., 2024).
1. Pipeline and Input Representation
The MeshRet pipeline accepts as input a source motion sequence , where represents root joint translations and represents 6D joint rotations over frames; source and target template geometries are denoted as and , encoding mesh vertices and rest-pose skeletons. The core stages are:
- Semantically Consistent Sensor extraction (SCS): , with likewise extracted.
- Sensor Forward Kinematics and Dense Mesh Interaction (DMI) field construction: The function maps source motion and SCS locations to a field , describing spatio-temporal interactions.
- Retargeting Network: A transformer encoder/decoder produces the predicted target sequence 0, where, unlike prior methods, both motion semantics and geometry interactions are optimized simultaneously.
This framework obviates the need for post-hoc collision or contact correction by redefining the retargeting target: the preservation of the dense geometric interaction field itself.
2. Semantically Consistent Sensors (SCS)
SCS are sets of dense, taxonomy-aligned sample points established on both source and target meshes regardless of surface topology. Each sensor is parameterized by a semantic triplet 1:
- 2: Bone index (3), referencing the skeleton's medial axis.
- 4: Normalized offset along bone 5.
- 6: Ray angle within the local plane orthogonal to bone 7.
Positions are determined by algorithmically casting a ray from 8 in direction 9 and recording the intersection with mesh surface for bone 0. Each sensor yields a tuple 1, the 3D intersection and the tangent-space basis. By synchronizing the 2 indexing on both characters, semantically aligned, dense mesh correspondences are maintained even for dissimilar topologies.
3. Dense Mesh Interaction Field and Sparsification
The DMI field encodes relative geometric relationships between SCS points over time:
- For each pair 3 at frame 4,
5
- The full DMI is thus 6.
- For computational tractability, pairs are sparsified: 7 observation sensors are selected, each referencing 8 associated target sensors (half nearest, half furthest), yielding a final representation of
9
- The field can be conceptualized in continuous fashion via weighted aggregation over SCS.
Sparsification preserves key contact and farfield interactions, ensuring critical relationships (e.g., hand-to-body, foot-to-ground) are prioritized in the retargeting process.
4. Loss Functions and Training Objectives
MeshRet is trained without reference to ground-truth target motions (unsupervised for target), using the following loss terms:
- Pose regularization (reconstruction):
0
- DMI consistency (geometry interaction preservation):
1
- Adversarial loss (motion plausibility):
2
- End-effector orientation loss:
3
- Total loss:
4
All losses, except the adversarial and end-effector terms, are computed through DMI or the original source motion, allowing the framework to learn to synthesize plausible target motions that preserve fine-grained geometric relationships.
5. Network Architecture
MeshRet employs a combination of PointNet and Transformer modules:
- SCS Geometry Encoder: PointNet-style aggregation over sensor features 5 produces global geometry encodings 6.
- DMI Encoder: A two-stage PointNet pipeline:
- Per-sensor: Each 7 point cloud comprising 8 per sensor is encoded as 9.
- Per-frame: Aggregation across 0 sensors yields 1.
- Transformer Retargeting Network: The encoder receives DMI and target geometry, and the decoder receives source joint pose and geometry. Configuration: 8 transformer layers, 4 attention heads, feed-forward size 256, 2. A specialized attention mask couples each target frame prediction to its corresponding DMI representation, ensuring temporal and spatial coherence.
6. Contact Preservation and Self-Interpenetration Avoidance
The application of DMI-driven loss terms enforces fidelity in both contact and near-contact relationships. For any pair 3, large changes in vector direction or shrinkage (potentially reflecting mesh collision or loss of contact) are penalized via 4. For contact pairs (determined by a threshold 5 arm diameter), this effect forces the model to preserve semantic and physical plausibility. No explicit mesh collision or contact penalty is required; preservation of the learned DMI field is sufficient to implicitly avoid most self-intersections and maintain correct contacts.
A plausible implication is that, due to the explicit modeling of pairwise geometric relationships in sensor-tangent space, MeshRet generalizes robustly across both synthetic and real-world mesh scanning domains, where geometry topology can be highly variable.
7. Empirical Results and Benchmarks
MeshRet was evaluated quantitatively and qualitatively on the Mixamo+ (cartoon+ScanRet) and ScanRet datasets. Key metrics include contact error, penetration percentage, and joint mean squared error (MSE). Table 1 summarizes core results.
| Metric | Ours | PMnet | SAN | R²ET |
|---|---|---|---|---|
| Contact Error (Mixamo+) | 0.772 | 2.716 | 2.432 | 2.209 |
| Contact Error (ScanRet) | 0.284 | 0.890 | 0.627 | 0.589 |
| Penetration % (Mixamo+) | 3.45 | 5.23 | 4.95 | 4.21 |
| Penetration % (ScanRet) | 1.59 | 2.23 | 1.72 | 2.01 |
| Joint MSE (ScanRet) | 0.047 | 0.130 | 0.049 | 0.063 |
User studies (600 comparisons) reported approximately 81% preference for MeshRet across overall motion quality, contact accuracy, and semantics preservation.
References
"Skinned Motion Retargeting with Dense Geometric Interaction Perception" (Ye et al., 2024)