Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos (2509.16757v3)

Published 20 Sep 2025 in cs.RO

Abstract: Enabling robust whole-body humanoid-object interaction (HOI) remains challenging due to motion data scarcity and the contact-rich nature. We present HDMI (HumanoiD iMitation for Interaction), a simple and general framework that learns whole-body humanoid-object interaction skills directly from monocular RGB videos. Our pipeline (i) extracts and retargets human and object trajectories from unconstrained videos to build structured motion datasets, (ii) trains a reinforcement learning (RL) policy to co-track robot and object states with three key designs: a unified object representation, a residual action space, and a general interaction reward, and (iii) zero-shot deploys the RL policies on real humanoid robots. Extensive sim-to-real experiments on a Unitree G1 humanoid demonstrate the robustness and generality of our approach: HDMI achieves 67 consecutive door traversals and successfully performs 6 distinct loco-manipulation tasks in the real world and 14 tasks in simulation. Our results establish HDMI as a simple and general framework for acquiring interactive humanoid skills from human videos.

Summary

  • The paper presents a three-stage pipeline that extracts human and object trajectories from videos, trains an RL policy with a residual action space, and enables zero-shot deployment on real robots.
  • The paper demonstrates that a unified object representation and tailored interaction rewards significantly improve exploration stability and performance in complex tasks.
  • The paper validates HDMI with real-world experiments, showing robust number of tasks like door traversal and box locomotion while adapting to varying environmental conditions.

HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos

The paper "HDMI: Learning Interactive Humanoid Whole-Body Control from Human Videos" introduces a novel framework called HDMI, which utilizes monocular RGB videos to endow humanoid robots with the ability to perform diverse whole-body interaction tasks. This approach leverages human demonstration videos to create a structured dataset for reinforcement learning (RL). The result is a general framework capable of producing robust and versatile humanoid-object interaction skills.

Framework Overview

HDMI comprises a three-stage pipeline designed to handle the complexity of whole-body humanoid-object interactions:

  1. Data Extraction and Retargeting: Human and object trajectories are extracted and retargeted from monocular RGB videos using pose estimation methods. This step generates a structured motion dataset, which includes desired contact points and reference trajectories. Figure 1

    Figure 1: HDMI is a general framework for interactive skill learning. Monocular RGB videos are processed into a structured dataset as reference trajectories.

  2. Reinforcement Learning Policy Training: An interaction-centric RL policy is trained using the structured dataset. Key components include:
    • A unified object representation for accommodating diverse object types and interactions.
    • A residual action space that facilitates stable exploration and efficient learning of complex poses.
    • An interaction reward designed to promote robust and precise contact behavior.
  3. Zero-Shot Deployment: The trained policies are deployed on real humanoid robots without additional fine-tuning, demonstrating successful execution of interaction tasks in real-world scenarios.

Key Components

Unified Object Representation

The method employs a unified object representation, leveraging spatially invariant object observations that are transformed into the robot's local frame. This representation allows HDMI to generalize across varied object geometries and operate effectively with diverse objects, ensuring wide applicability. Figure 2

Figure 2

Figure 2

Figure 2: Reference contact position (yellow dot) in three different tasks. Policy observes these during training and deployment.

Residual Action Space

The use of a residual action space is critical for learning challenging poses such as kneeling. Instead of absolute joint targets, the policy learns offsets from reference poses, anchoring initial exploration and significantly improving training efficiency and convergence speed.

Interaction Reward

To compensate for the inherent limitations of kinematic reference trajectories, which may lack precise contact dynamics, HDMI introduces an interaction reward that incentivizes proper contact maintenance. This reward is crucial for ensuring task performance when dealing with imperfect motion references.

Real-World Evaluation

The effectiveness of HDMI is validated through a series of real-world experiments on challenging tasks such as door traversal and box locomotion. The framework demonstrated the ability to:

  • Execute long-horizon task sequences.
  • Adapt to environmental variations, such as terrain changes.
  • Perform whole-body coordination for object manipulation. Figure 3

    Figure 3: Demonstrations on challenging real-world tasks, showcasing the robot's ability to adapt its movements and maintain whole-body coordination.

Simulation Ablations

Comprehensive ablation studies underscore the importance of HDMI's core components. Interaction rewards and residual action spaces are shown to be essential for handling imperfect reference motions and improving exploration stability. Figure 4

Figure 4: Final success rate across 8 tasks, highlighting the critical role of interaction reward and contact-based termination.

Conclusion and Future Directions

The HDMI framework marks a significant advancement in enabling humanoids to learn complex interaction skills from RGB videos. Looking ahead, future work will aim to reduce the reliance on motion capture and explore the development of generalized models capable of handling multiple interaction tasks with a single policy. This would enhance the robot’s adaptability in real-world, uninstrumented environments and broaden the framework's applicability in practical settings.

The research offers promising avenues for integrative humanoid control, laying the groundwork for deploying robust interactive humanoids in dynamic human environments.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 posts and received 1255 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube