- The paper introduces InterMimic, a teacher-student framework that distills knowledge from multiple experts into a single policy for physics-based human-object interaction.
- InterMimic achieves longer tracking durations and lower human and object errors compared to SkillMimic on datasets like BEHAVE, producing realistic and diverse interactions.
- The framework generalizes zero-shot and utilizes novel techniques like Physical State Initialization and Interaction Early Termination, enabling integration with kinematic generators for downstream tasks.
The paper introduces InterMimic, a framework designed to train physically simulated humans to perform whole-body motor skills for interacting with diverse and dynamic objects. The method addresses challenges in human-object interaction (HOI) imitation, including imperfect MoCap data, human shape variability, and the integration of diverse skills into a single policy.
The core of InterMimic is a curriculum-based teacher-student distillation framework. Multiple teacher policies are trained on smaller subsets of interaction data to mimic, retarget, and refine motion capture data. These teachers then act as online experts, providing supervision and high-quality references for a student policy. The student policy is further refined through RL fine-tuning to surpass mere imitation and achieve higher-quality solutions.
Key components and contributions include:
- Teacher-Student Training Strategy: A teacher-student framework is introduced to address retargeting and refinement challenges in HOI imitation. Teacher policies provide refined HOI references with unified embodiment and enhanced physical fidelity. Multiple teacher policies are trained in parallel on smaller data subsets, and their expertise is distilled into a single student policy. The student policy leverages demonstration-based distillation to bootstrap PPO updates and gradually shifts to increased RL updates.
- Physical State Initialization (PSI): To address issues with imperfect reference data, PSI is proposed, creating an initialization buffer with reference states from MoCap and simulation states from prior rollouts. For each new rollout, an initial state is randomly selected from this buffer. Trajectories are evaluated based on their expected discounted rewards, and those above a threshold are added to the buffer using a FIFO strategy.
- Interaction Early Termination (IET): Supplements Early Termination (ET) with three extra checks: (i) Object points deviate from their references by more than 0.5 m on average. (ii) Weighted average distances between the character’s joints and the object surface exceed 0.5 m from the reference. (iii) Any required body-object contact is lost for over 10 consecutive frames.
- Embodiment-Aware Reward: Weights wd are inversely proportional to the distances between joints and the object. The reward includes cost functions for joint position Eph=⟨Δph,wd⟩, rotation Eθh=⟨Δθh,1−wd⟩, and interaction tracking Ed=⟨Δd,wd⟩, where ⟨⋅,⋅⟩ is the inner product, $\boldsymbol\Delta^h_{p}[i]=\|\hat{\boldsymbol{p}^h[i] - \boldsymbol{p}^h[i]\|$, $\boldsymbol\Delta^h_{\theta}[i]=\|\hat{\boldsymbol{\theta}^h[i] \ominus \boldsymbol{\theta}^h[i]\|$, and $\boldsymbol\Delta_{d}[i]=\|\hat{\boldsymbol{d}[i] - \boldsymbol{d}[i]\|$.
- wd: Weights that are inversely proportional to the distances between joints and the object
- Eph: Cost functions for joint position
- Δph: Displacement for the position variable with timestep t omitted
- Eθh: Cost functions for joint rotation
- Δθh: Displacement for the rotation variable with timestep t omitted
- Ed: Cost functions for interaction tracking
- Δd: Displacement for the interaction tracking variable with timestep t omitted
- Policy Representation: The state st comprises two components st={sts,stg}. The first part, sts, contains human proprioception and object observations, expressed as, {{θth,pth,ωth,vth},{θto,pto,ωto,vto},{dt,ct}}, where {θth,pth,ωth,vth} represent the rotation, position, angular velocity, and velocity of all joints, respectively, while $\{\boldsymbol \theta}_t^o, \boldsymbol p_t^o, {\boldsymbol \omega}_t^o, \boldsymbol v_t^o\}$ represent the orientation, location, velocity, and angular velocity of the object, respectively.
- st: The state, which serves as input to the policy
- sts: Human proprioception and object observations
- {θth,pth,ωth,vth}: Rotation, position, angular velocity, and velocity of all joints
- $\{\boldsymbol \theta}_t^o, \boldsymbol p_t^o, {\boldsymbol \omega}_t^o, \boldsymbol v_t^o\}$: Orientation, location, velocity, and angular velocity of the object
The method was evaluated on the OMOMO, BEHAVE, HODome, IMHD, and HIMO datasets. Key metrics include success rate, duration, human tracking error, and object tracking error. The results demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets and generalizes in a zero-shot manner, integrating with kinematic generators. For example, in experiments with the BEHAVE dataset, InterMimic achieved longer tracking durations (42.6 seconds) compared to SkiLLMimic (12.2 seconds), with lower human (6.4 cm vs. 7.2 cm) and object (9.2 cm vs. 13.4 cm) tracking errors. Ablation studies validate the effectiveness of PSI and the joint PPO and DAgger updates.
The authors claim that InterMimic effectively handles versatile physics-based interaction animation, recovering motions with realistic and physically plausible details. By combining kinematic generators with InterMimic, a physics-based agent can achieve tasks such as interaction prediction and text-to-interaction generation.