Scalable Physics-Informed DExterous Retargeting (SPIDER)
- The paper introduces a scalable framework that retargets large-scale human motion data into physics-compliant, dynamically feasible trajectories for dexterous robots and humanoids.
- The methodology combines constrained trajectory optimization with a sampling-based approach and curriculum-style virtual contact guidance to enforce accurate contact dynamics.
- Experiments demonstrate substantial improvements in retargeting success and computational efficiency, enabling rapid, scalable dataset generation and effective policy learning.
Scalable Physics-Informed DExterous Retargeting (SPIDER) is a trajectory optimization framework designed to address the challenge of translating large-scale, kinematic-only human demonstrations into dynamically feasible and contact-correct trajectories for dexterous robots and humanoids. By leveraging abundant human motion data from sources such as motion capture and video, SPIDER efficiently bridges the embodiment and dynamics gap between human demonstrators and robots, facilitating scalable dataset generation and policy learning across diverse robotic platforms.
1. Mathematical Formulation and Optimization
SPIDER frames the retargeting problem as a constrained trajectory optimization with strict compliance to robot dynamics and contact mechanics. Given a human demonstration, the target is to find a sequence of robot control commands such that the resulting state trajectory both closely tracks the reference kinematics and obeys physics constraints:
Here, encodes the robot state (joint angles, velocities, object pose, contact forces), the applied torques, and the mapped kinematic reference from the human demonstration. The cost combines time-indexed trajectory tracking and control effort, weighted by . The dynamics rollout can be realized by high-fidelity simulators (e.g., MuJoCo), enforcing contact complementarity, friction cone constraints, and actuator limits.
Human reference trajectories are converted to the robot via inverse kinematics: where end-effector and object keypoint mappings are explicitly resolved.
2. Physics-Based Sampling and Virtual Contact Guidance
Due to the highly non-convex nature of multi-contact trajectory optimization, SPIDER employs a parallelized, sampling-based optimizer reminiscent of MPPI/CEM, with time- and iteration-annealed noise schemes. For each optimization iteration , candidate controls are sampled around the current mean with a covariance matrix whose scale decays over both iterations () and horizon time ():
The weighted average of the noise particles, using Boltzmann probabilities , updates the control sequence iteratively. Early stopping is triggered if progress falls below .
To bias sampling towards human-intended contact modes, a curriculum-style virtual contact guidance is integrated. Reference contact pairs are extracted from the demo, and a quadratic penalty ensures that the appropriate robot link maintains relative position to the object within decreasing tolerance :
$c_{k,t} \| \, ^{\mathrm{r}}p_{k,t}^{o} - ^{\mathrm{r}}p_{k,t}^{o,\mathrm{ref}} \|^2 \le \eta_i$
Where is a binary indicator activating the constraint when the demonstration signals contact. This penalty is strong in later iterations, acting as a curriculum that transitions from broad exploration to tightly guided exploitation, with dynamic activation and filtering to prevent guidance drift due to spurious or brief contacts.
3. Dynamical Feasibility and Contact Alignment
Upon convergence, the optimized trajectories undergo further checks and refinement to guarantee physical and contact plausibility:
- Penetration Check: Ensures all signed distance functions , i.e., no penetrative contacts.
- Torque and Friction Validation: Ensures actuation respects and contact wrenches remain within prescribed friction cones.
- Contact-Mode Correction: If simulation-derived contact indicators diverge from the demonstration's , SPIDER locally re-optimizes the problematic segments with a tighter guidance penalty to realign contact sequences.
This two-stage process—global sampling followed by local contact-mode refinement—resolves the disconnect between human-demonstrated and robot-executable contact transitions, which is critical for skill transfer involving intricate object manipulation or whole-body coordination.
4. Implementation, Scalability, and Efficiency
SPIDER is robot- and task-agnostic, demonstrated on five dexterous hands (e.g., Allegro 16-DOF, XHand/Inspire/Ability 12-DOF, Schunk 20-DOF) and four humanoid platforms (Unitree G1, H1-2, Fourier N1, Booster T1), covering object manipulation and whole-body tasks. Physics rollout employs MuJoCo Warp (100 Hz simulation, 50 Hz control), Genesis, or Isaac Gym, executing on a single GPU.
Typical hyperparameters include a 1.2 s time horizon ( steps), particles per iteration, optimization steps, annealing constants , cost temperature , and stopping tolerance . On NVIDIA H100/RTX4090, retargeting speeds reach 2.5 FPS (Oakink) and 3 Hz (GigaHands)—orders of magnitude faster than RL-based trajectory generation (e.g., ManipTrans, DexMachina at 0.1–0.2 Hz).
A condensed pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
algorithm SPIDER_Retarget(x_ref_{0:T})
initialize U⁰ via IK
for i = 1 ... N:
compute Σⁱ (annealed)
for j = 1 ... N_W in parallel:
U_j = Uⁱ + noise_j
J_j = rolloutCost(U_j, x_ref, η_i)
U^{i+1} = weightedAverage({U_j, J_j})
if earlyStop: break
end for
refineContacts(U*, x*)
robustify(U*, x*, dynamicsVariations)
return final U*, x* |
5. Experimental Findings and Dataset Generation
Comprehensive experiments across datasets demonstrate that SPIDER yields significant improvements in retargeting success rates and efficiency. In ablation studies on Oakink and GigaHands (13 challenge-trajectories), IK-only baselines achieve 0–13% success. Standard sampling increases this to 23–64%, with annealing raising it to 40–100%. Adding contact guidance (full SPIDER) produces a further 18% absolute improvement (hand-averaged).
SPIDER was used to retarget six datasets, including GigaHands (756 trajectories), Oakink (1022), ARCTIC (7), LAFAN1, AMASS, and OMOMO. The resulting compendium comprises 262 episodes, 800 hours of simulated motion, and 2.4 million frames spanning 103 objects and 5 hands.
A table overview of mean retargeting success rates and comparison to RL baselines:
| Dataset (Hand) | SPIDER Success Rate | RL Baseline (Hz, Success) |
|---|---|---|
| GigaHands (Allegro) | ~0.81 | – |
| GigaHands (Inspire) | 0.88 | – |
| Oakink (various hands) | ~0.46–0.48 | ManipTrans 0.1 Hz, 39.5% |
| ARCTIC | 0.42 @ 1.5 Hz | DexMachina 0.05 Hz, 67.1% |
Every retargeted episode provides temporally indexed robot and object states, contact indicators and wrenches, feedforward torque commands, and robustification seeds. This structure is specifically designed to support downstream RL policy learning.
For humanoid loco-manipulation, SPIDER’s data enables efficient RL with residual-policy architectures. When training a residual policy to augment SPIDER-provided torques (), PPO converges robustly in approximately 1 million steps—faster than from noisy human data alone. Without SPIDER guidance, direct BC-then-RL pipelines stagnate due to insufficient contact information.
6. Real-World Deployment and Practical Implications
SPIDER-generated trajectories have been executed open-loop on physical Franka+Allegro platforms, performing manipulation tasks such as light-bulb rotation, guitar playing, spoon pick-and-place, and charger unplugging without policy fine-tuning beyond domain randomization for robustification. These outcomes validate sim-to-real transfer capabilities inherent to SPIDER’s physics-informed retargeting strategy.
In summary, SPIDER synthesizes scalable, sampling-based MPC with curriculum-informed virtual contact constraints to systematically transform human kinematic demonstrations into robust robot trajectories. Its scalability, efficiency, and generality across multiple robot embodiments and datasets represent a significant step toward closing the data gap in dexterous robotics and enabling efficient data-driven policy learning at scale.