Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions (2502.20390v1)

Published 27 Feb 2025 in cs.CV, cs.GR, and cs.RO

Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sirui Xu (6 papers)
  2. Hung Yu Ling (6 papers)
  3. Yu-Xiong Wang (87 papers)
  4. Liang-Yan Gui (18 papers)

Summary

  • The paper introduces InterMimic, a teacher-student framework that distills knowledge from multiple experts into a single policy for physics-based human-object interaction.
  • InterMimic achieves longer tracking durations and lower human and object errors compared to SkillMimic on datasets like BEHAVE, producing realistic and diverse interactions.
  • The framework generalizes zero-shot and utilizes novel techniques like Physical State Initialization and Interaction Early Termination, enabling integration with kinematic generators for downstream tasks.

The paper introduces InterMimic, a framework designed to train physically simulated humans to perform whole-body motor skills for interacting with diverse and dynamic objects. The method addresses challenges in human-object interaction (HOI) imitation, including imperfect MoCap data, human shape variability, and the integration of diverse skills into a single policy.

The core of InterMimic is a curriculum-based teacher-student distillation framework. Multiple teacher policies are trained on smaller subsets of interaction data to mimic, retarget, and refine motion capture data. These teachers then act as online experts, providing supervision and high-quality references for a student policy. The student policy is further refined through RL fine-tuning to surpass mere imitation and achieve higher-quality solutions.

Key components and contributions include:

  • Teacher-Student Training Strategy: A teacher-student framework is introduced to address retargeting and refinement challenges in HOI imitation. Teacher policies provide refined HOI references with unified embodiment and enhanced physical fidelity. Multiple teacher policies are trained in parallel on smaller data subsets, and their expertise is distilled into a single student policy. The student policy leverages demonstration-based distillation to bootstrap PPO updates and gradually shifts to increased RL updates.
  • Physical State Initialization (PSI): To address issues with imperfect reference data, PSI is proposed, creating an initialization buffer with reference states from MoCap and simulation states from prior rollouts. For each new rollout, an initial state is randomly selected from this buffer. Trajectories are evaluated based on their expected discounted rewards, and those above a threshold are added to the buffer using a FIFO strategy.
  • Interaction Early Termination (IET): Supplements Early Termination (ET) with three extra checks: (i) Object points deviate from their references by more than 0.5 m on average. (ii) Weighted average distances between the character’s joints and the object surface exceed 0.5 m from the reference. (iii) Any required body-object contact is lost for over 10 consecutive frames.
  • Embodiment-Aware Reward: Weights wd\boldsymbol{w}_d are inversely proportional to the distances between joints and the object. The reward includes cost functions for joint position Eph=Δph,wdE_p^h = \langle \boldsymbol\Delta^h_{p}, \boldsymbol{w}_d \rangle, rotation Eθh=Δθh,1wdE_{\theta}^h = \langle \boldsymbol\Delta^h_{\theta}, \boldsymbol 1 - \boldsymbol{w}_d \rangle, and interaction tracking Ed=Δd,wdE_d = \langle \boldsymbol\Delta_{d}, \boldsymbol{w}_d \rangle, where ,\langle \cdot, \cdot \rangle is the inner product, $\boldsymbol\Delta^h_{p}[i]=\|\hat{\boldsymbol{p}^h[i] - \boldsymbol{p}^h[i]\|$, $\boldsymbol\Delta^h_{\theta}[i]=\|\hat{\boldsymbol{\theta}^h[i] \ominus \boldsymbol{\theta}^h[i]\|$, and $\boldsymbol\Delta_{d}[i]=\|\hat{\boldsymbol{d}[i] - \boldsymbol{d}[i]\|$.
    • wd\boldsymbol{w}_d: Weights that are inversely proportional to the distances between joints and the object
    • EphE_p^h: Cost functions for joint position
    • Δph\boldsymbol\Delta^h_{p}: Displacement for the position variable with timestep tt omitted
    • EθhE_{\theta}^h: Cost functions for joint rotation
    • Δθh\boldsymbol\Delta^h_{\theta}: Displacement for the rotation variable with timestep tt omitted
    • EdE_d: Cost functions for interaction tracking
    • Δd\boldsymbol\Delta_{d}: Displacement for the interaction tracking variable with timestep tt omitted
  • Policy Representation: The state st\boldsymbol{s}_t comprises two components st={sts,stg}\boldsymbol{s}_t = \{\boldsymbol{s}_t^s, \boldsymbol{s}_t^g\}. The first part, sts\boldsymbol{s}_t^s, contains human proprioception and object observations, expressed as, {{θth,pth,ωth,vth},{θto,pto,ωto,vto},{dt,ct}},\{\{\boldsymbol{\theta}_t^h, \boldsymbol p_t^h, \boldsymbol{\omega}_t^h, \boldsymbol v_t^h\}, \{\boldsymbol{\theta}_t^o, \boldsymbol p_t^o, \boldsymbol{\omega}_t^o, \boldsymbol v_t^o\}, \{\boldsymbol{d}_t, \boldsymbol{c}_t\}\}, where {θth,pth,ωth,vth}\{\boldsymbol{\theta}_t^h, \boldsymbol p_t^h, \boldsymbol{\omega}_t^h, \boldsymbol v_t^h\} represent the rotation, position, angular velocity, and velocity of all joints, respectively, while $\{\boldsymbol \theta}_t^o, \boldsymbol p_t^o, {\boldsymbol \omega}_t^o, \boldsymbol v_t^o\}$ represent the orientation, location, velocity, and angular velocity of the object, respectively.
    • st\boldsymbol{s}_t: The state, which serves as input to the policy
    • sts\boldsymbol{s}_t^s: Human proprioception and object observations
    • {θth,pth,ωth,vth}\{\boldsymbol{\theta}_t^h, \boldsymbol p_t^h, \boldsymbol{\omega}_t^h, \boldsymbol v_t^h\}: Rotation, position, angular velocity, and velocity of all joints
    • $\{\boldsymbol \theta}_t^o, \boldsymbol p_t^o, {\boldsymbol \omega}_t^o, \boldsymbol v_t^o\}$: Orientation, location, velocity, and angular velocity of the object

The method was evaluated on the OMOMO, BEHAVE, HODome, IMHD, and HIMO datasets. Key metrics include success rate, duration, human tracking error, and object tracking error. The results demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets and generalizes in a zero-shot manner, integrating with kinematic generators. For example, in experiments with the BEHAVE dataset, InterMimic achieved longer tracking durations (42.6 seconds) compared to SkiLLMimic (12.2 seconds), with lower human (6.4 cm vs. 7.2 cm) and object (9.2 cm vs. 13.4 cm) tracking errors. Ablation studies validate the effectiveness of PSI and the joint PPO and DAgger updates.

The authors claim that InterMimic effectively handles versatile physics-based interaction animation, recovering motions with realistic and physically plausible details. By combining kinematic generators with InterMimic, a physics-based agent can achieve tasks such as interaction prediction and text-to-interaction generation.

Youtube Logo Streamline Icon: https://streamlinehq.com