PhysHSI: Unified Humanoid Interaction Framework
- PhysHSI is a unified framework for humanoid-scene interaction that integrates simulation-based training with robust, real-world perception.
- It employs adversarial motion priors from human motion capture data and hybrid reference state initialization to achieve lifelike, generalizable motions.
- The system features a coarse-to-fine object localization stack using LiDAR and AprilTag-based camera detection to ensure reliable operation across diverse tasks.
PhysHSI is a unified framework for enabling humanoid robots to perform diverse, physically plausible, and generalizable real-world interaction tasks such as carrying objects, sitting, lying, and standing up. The system integrates adversarial motion prior-based policy learning from human motion capture data in simulation and a robust, coarse-to-fine object localization stack employing LiDAR and camera inputs during real-world deployment. PhysHSI emphasizes both the lifelike motion and operational generalization required for natural, reliable humanoid-scene interactions.
1. System Components and Architecture
PhysHSI consists of two major subsystems:
- Simulation Training Pipeline: Human motion capture datasets (e.g., AMASS, SAMP) are retargeted to the target humanoid platform, producing annotated interaction data that include object manipulation frames. Policies are trained using an adversarial motion prior (AMP) reinforcement learning paradigm to imitate lifelike human behaviors, supported by techniques such as reference state initialization (RSI), domain randomization, and regularization for physically valid movement.
- Real-World Deployment System: On the Unitree G1 humanoid robot, PhysHSI couples a multi-stage perception module with policy execution. Object localization operates in a coarse-to-fine manner: LiDAR and odometry (via FAST-LIO) provide an initial pose estimate, while AprilTag-based camera detection refines localization when objects enter the field of view. The onboard Jetson Orin NX edge computer handles real-time sensor processing and control inference.
This architecture allows PhysHSI to exploit high-fidelity simulated experience and robust real-world perception for generalizable scene interaction.
2. Simulation Training Pipeline: Policy Learning via Adversarial Motion Priors
The reinforcement learning policy, denoted πₜ, is trained to generate actions that result in natural and successful humanoid-scene interactions. Key technical details include:
- Motion Imitation: The adversarial discriminator 𝔻 is trained to distinguish policy-generated state trajectories (oᵈₜ₋ₜ* :ₜ, incorporating privileged information such as object position, base height, joint angles) from real human references. Style reward incentivization is given by rₜˢ = −log(1 − 𝔻(oᵈₜ₋ₜ* :ₜ).
- Reward Structure: The cumulative reward combines task-specific signals (rₜᴳ), regularization (rₜᴿ, including terms for physical plausibility such as L2C2 smoothing), and discriminator-derived style rewards (rₜˢ):
rₜ = wᴳ rₜᴳ + wᴿ rₜᴿ + wˢ rₜˢ
This reward is maximized using proximal policy optimization (PPO):
E[∑ₜ γt−1 rₜ]
- Hybrid Reference State Initialization (RSI): To facilitate exploration and stability during training, episodes are initialized from randomly sampled reference states with randomized future scene configurations, improving generalization and sample efficiency.
- Domain Randomization and Asymmetric Actor-Critic Training: Noise, offsets, and delays are introduced to object poses and kinematics during training to close the sim-to-real gap. The critic receives richer observations than the actor, enhancing policy robustness under partial observability.
3. Real-World Deployment: Coarse-to-Fine Object Localization
Accurate, continuous perception of dynamic scene objects—critical for lifelike interaction—is achieved through two-stage localization:
- Coarse Localization: Initialization uses LiDAR-based visualization for manual object pose specification T₍b₀₎o₀. During execution, the robot's base odometry T₍b₀₎bₜ is updated via FAST-LIO, yielding the transformed object pose in the new base frame:
T₍bₜ₎oₜ = (T₍b₀₎bₜ)⁻¹ * T₍b₀₎o₀
- Fine Localization: When the object enters the camera's view (e.g., within 2.4 m), AprilTag detection yields accurate 3D localization. Temporary losses in detection are handled by propagating the last valid pose using base odometry.
- Dynamic Object Handling: For manipulation (e.g., when an object leaves the camera view post-grasp), the system masks the object pose and relies on proprioceptive feedback.
This hierarchical approach mitigates occlusion and field-of-view limitations present in real-world robot deployments.
4. Empirical Validation: Success Rates and Motion Quality
PhysHSI is evaluated on four representative tasks (box carrying, sitting, lying, standing up) in both simulated and real-world settings with the following metrics:
- Success Rate (R₍succ₎): Defined by task completion within prescribed error bounds (e.g., box placement error < 0.1 m). In real-world tests, tasks such as box lifting achieve rates up to 8/10, with full-sequence completion at 6/10.
- Human-Likeness Score (S₍human₎): Assessed via Gemini-2.5-Pro (scale 0–5) for fluidity, style, and plausibility.
- Generalization: In simulation, PhysHSI maintains high success under “full-distribution” settings where object properties are randomly varied. Motion imitated from AMP produces more lifelike trajectories compared to RL reward-based or kinematic tracking baselines.
- Precision: Real-world deployments exhibit controlled errors (e.g., box placement within 20 cm) and succeed across diverse indoor and outdoor environments.
5. Addressing Challenges: Generalization, Robust Perception, Sim-to-Real Adaptation
Key technical challenges and their corresponding solutions include:
- Generalization and Natural Motion: The amplifier effect of adversarial motion priors and hybrid RSI enables effective transfer across unseen interaction scenarios without explicit reward engineering.
- Robust Perception Under Limited Sensing: The coarse-to-fine fusion of LiDAR and camera (AprilTag) data supports continuous, accurate object pose estimation, overcoming challenges with occlusion and limited sensor field-of-view.
- Sim-to-Real Gap: Domain randomization and asymmetric observations in policy learning bolster transferability to the real robot platform, handling the realities of noisy signals and hardware constraints.
6. Limitations and Future Directions
PhysHSI identifies avenues for expansion:
- Scalable Data Collection: Automated annotation and diversified task datasets could further generalize HSI policy learning.
- Active Perception and Sensor Fusion: Enabling autonomous exploration to optimize scene understanding and localization is expected to further reduce manual specification requirements.
- Manipulator Dexterity: Hardware upgrades for improved gripping and manipulation would extend system applicability to more varied objects and contexts.
- Complex Scene Semantics and Feedback: Integrating richer sensory modalities (depth, tactile feedback) for enhanced real-time adaptation and more nuanced interactive tasks.
7. Technical Synopsis and Outlook
PhysHSI represents a significant advance in physically plausible humanoid-scene interaction, incorporating AMP-based policy learning, domain randomization, and a coarse-to-fine perception stack within a unified framework. Its validated performance in both simulated and real robotic environments illustrates the effectiveness of combining principled simulation training with robust, multimodal real-world perception. The approach demonstrates scalable generalization, natural motion, and continuous scene understanding, laying a foundation for future work on complex humanoid interaction and adaptive robotics in dynamic physical environments (Wang et al., 13 Oct 2025).