Papers
Topics
Authors
Recent
Search
2000 character limit reached

HumanX: Toward Agile and Generalizable Humanoid Interaction Skills from Human Videos

Published 2 Feb 2026 in cs.RO and cs.LG | (2602.02473v1)

Abstract: Enabling humanoid robots to perform agile and adaptive interactive tasks has long been a core challenge in robotics. Current approaches are bottlenecked by either the scarcity of realistic interaction data or the need for meticulous, task-specific reward engineering, which limits their scalability. To narrow this gap, we present HumanX, a full-stack framework that compiles human video into generalizable, real-world interaction skills for humanoids, without task-specific rewards. HumanX integrates two co-designed components: XGen, a data generation pipeline that synthesizes diverse and physically plausible robot interaction data from video while supporting scalable data augmentation; and XMimic, a unified imitation learning framework that learns generalizable interaction skills. Evaluated across five distinct domains--basketball, football, badminton, cargo pickup, and reactive fighting--HumanX successfully acquires 10 different skills and transfers them zero-shot to a physical Unitree G1 humanoid. The learned capabilities include complex maneuvers such as pump-fake turnaround fadeaway jumpshots without any external perception, as well as interactive tasks like sustained human-robot passing sequences over 10 consecutive cycles--learned from a single video demonstration. Our experiments show that HumanX achieves over 8 times higher generalization success than prior methods, demonstrating a scalable and task-agnostic pathway for learning versatile, real-world robot interactive skills.

Summary

  • The paper presents a novel framework that directly transfers complex human interaction skills to humanoid robots without task-specific reward engineering.
  • It combines XGen's contact-aware data synthesis with XMimic’s two-stage teacher-student imitation learning, achieving over eightfold improvement in generalization success.
  • Real-world tests on the Unitree G1 demonstrate zero-shot policy transfer and robust adaptation in dynamic interaction scenarios.

HumanX: Acquisition of Agile and Generalizable Humanoid Interaction Skills from Human Videos

Framework Overview

HumanX proposes a full-stack framework targeting the direct synthesis and learning of humanoid interaction skills from monocular human videos. The system eliminates the requirement for task-specific reward engineering by coupling a scalable data generation pipeline (XGen) with a unified imitation learning paradigm (XMimic). HumanX demonstrates robust transfer of 10 diverse interaction skills—including complex basketball maneuvers, multi-pattern football kicks, cargo handling, and reactive fighting—onto real humanoid hardware (Unitree G1), attaining zero-shot generalization from single video demonstrations and outperforming previous baselines by an over eightfold margin in generalization success rate. Figure 1

Figure 1: HumanX framework pipeline, starting with SMPL-based motion estimation from video and retargeted synthesis for humanoids, followed by contact-aware segmentation, object retargeting, and phase-dependent interaction trajectory generation in XGen.

XGen: Physically Plausible Data Synthesis and Augmentation

XGen is designed to extract humanoid motion and synthesize interaction trajectories using contact-aware segmentation and physics-driven object trajectory generation. Given monocular video input, XGen first recovers the 3D SMPL-based human pose sequence via state-of-the-art inverse kinematics and pose estimation models, subsequently retargeting the sequence to humanoid morphology using GMR (Araujo et al., 2 Oct 2025). The interaction is then segmented into contact and non-contact phases. During contact, object anchors (e.g., the midpoint between the palms) are propagated through the robot pose trajectory, with force-closure optimization enforcing physical plausibility. Non-contact phases utilize physics simulation (IsaacGym [makoviychuk2021isaac]) to sample object dynamics. Figure 2

Figure 2: Augmentation strategies for the contact phase—object mesh scaling and trajectory transformations extend the diversity of robot-object interactions.

Figure 3

Figure 3: Non-contact phase trajectory augmentation, enabling skills to generalize to varied object launches and arrivals by initial velocity randomization in simulation.

Critically, XGen offers efficient data augmentation modalities: mesh scaling, trajectory variation, and physical simulation sampling. These mechanisms enable generalization to novel object parameters and environmental states from only a single video demonstration, eliminating the need for large-scale or high-fidelity interaction datasets and surfacing a wide distribution of physically consistent demonstration clips.

XMimic: Unified Imitation Learning Architecture

XMimic implements a two-stage teacher-student training paradigm. In the first stage, teacher policies are trained with privileged state observations for each skill pattern, leveraging a composite imitation-regularized reward. Teachers track reference demonstrations using a unified reward structure encompassing body state, object state, body-object relative pose, contact graph, and regulatory terms for stability. PPO [schulman2017proximal] is employed for policy optimization. Figure 4

Figure 4: XMimic two-stage workflow: privileged teacher learns using interaction imitation reward; student distilled under perceptual constraints (limited to proprioception or partial perception—MoCap) for direct real-world deployment.

The second stage distills teacher policies into a deployable student, restricting observation space to what is available in the chosen deployment mode (solely proprioceptive signals in NEP, or MoCap object perception). Student optimization combines RL gradients and explicit behavior cloning loss from expert teachers, yielding a robust multi-pattern interaction policy.

Notably, XMimic supports the direct inference of external forces through joint state history analysis based on robot dynamics and PD control torques, enabling proprioception-only skill control in the absence of dedicated tactile or vision sensors.

Generalization, Multi-Pattern Skill Acquisition, and Simulation Results

HumanX policies exhibit strong generalization to unseen object positions, trajectories, and targets beyond the scope of the demonstration video. This is achieved by combining broad offline diversity from XGen augmentation, online domain randomization (root/joint/object perturbation), and interaction-prioritized termination, preventing mode collapse and overfitting. Figure 5

Figure 5: Simulation demonstration—policy generalizes basketball catch-shot to trajectories and target locations not present in the original demonstration.

Figure 6

Figure 6: Learned multi-pattern skills; XMimic autonomously selects optimal football-kicking or badminton-hitting pattern contingent on the real-time object state.

Figure 7

Figure 7: Generalization visualization—single demonstration yields robust skill coverage over wide ranges of initial object state.

Ablation reveals key factors: disturbed initialization enhances out-of-distribution generalization, interaction termination policies suppress local optima, and teacher-student distillation is essential for multi-pattern unification. HumanX consistently surpasses previous approaches (SkillMimic [wang2025skillmimic], OmniRetarget [yang2025omniretarget], HDMI [weng2025hdmi]) in success rate and object/body tracking error.

Real-World Deployment and Emergent Behaviors

HumanX policies transfer zero-shot to the Unitree G1 humanoid across both NEP (proprioception-only) and MoCap deployment modes. NEP mode enables high-frequency execution of basketball skills without any explicit external object sensing; MoCap mode supports sustained closed-loop human-robot interaction for tasks including catch-pass, object pickup, and reactive fighting with robust performance under perception outage and external disturbance. Figure 8

Figure 8: NEP basketball—dynamic skill execution through proprioception-only policy; no object perception required.

Figure 9

Figure 9: MoCap-based interaction—continuous object tracking enables sustained interaction and closed-loop adaptation, all from single video training.

Policies exhibit emergent adaptive behaviors: autonomous recovery from dropped objects, real-time response to adversarial disturbances (e.g., forceful kicks), discrimination between human feints and attacks, and object re-acquisition—indicative of policy-level reasoning instead of mere trajectory replay. Figure 10

Figure 10: Emergent robust adaptation—robot recovers from forceful disturbance and reacquires object without explicit reward engineering.

Sim-to-real stability is directly linked to comprehensive domain randomization in training. Excluding simulated external forces or perception loss during training can result in catastrophic real-world failures. Figure 11

Figure 11: Sim-to-real analysis—incorporating force disturbance and MoCap signal loss in training enhances deployment robustness.

Implications and Future Directions

HumanX establishes an end-to-end, task-agnostic framework for encoding complex human motion and interaction skills in humanoid robots, solely from accessible video input and without reliance on engineered rewards. Its scalable data synthesis and augmentation pipeline, coupled with unified imitation and perceptual distillation, achieve both high-fidelity replication and unprecedented generalization from minimal demonstration data.

The implications are multifold:

  • Scalable Data Utilization: Efficient transformation of human demonstrations into broad robot-interaction datasets ready for learning, without hardware-based teleoperation or expensive multi-view capture.
  • Task Generalization: Models generalize far outside of training distribution, supporting long-horizon, multi-agent, multi-object interaction, and reactive real-time adaptation.
  • Sensor-Agnostic Deployment: Policies operate using minimal perception (pure proprioception) or partial object tracking (MoCap), maximizing robustness and lowering deployment complexity.
  • Foundation for Autonomous Reasoning: Emergent behaviors demonstrate that high-level adaptive reasoning can arise under a unified imitation regime with diversified data and reward structures.

Looking forward, future work could target:

  • Full integration with multimodal perception (scene context, natural language-driven goals).
  • Expansion to multi-agent, multi-object, or collaborative tasks.
  • Cross-platform adaptation to morphologically diverse robots via higher-level retargeting.
  • Automated decomposition and compositional learning for complex synthesized human behaviors.

Conclusion

HumanX delivers a scalable, generalizable, and robust pipeline for learning real-world humanoid interaction skills directly from human video. By combining XGen’s data synthesis and XMimic’s unified imitation architecture, HumanX policies achieve both high success rates and strong generalization far beyond prior approaches (2602.02473). This represents a significant advancement toward foundation models for autonomous humanoid interaction, paving the way for further research in generalist robot skill learning under minimal supervision.

Paper to Video (Beta)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Whiteboard

Explain it Like I'm 14

Overview of the Paper

This paper is about teaching humanoid robots (robots shaped like people) to do fast, smooth, and adaptable actions with objects and people—things like dribbling and shooting a basketball, kicking a football, picking up boxes, hitting a badminton shuttlecock, and even reacting to a human in a playful “sparring” scenario. The key idea is to let robots learn these skills by watching short human videos, rather than needing lots of special training or hand-crafted rules for each task.

Goals and Big Questions

The paper tries to answer simple but important questions:

  • Can a robot learn complex interaction skills (like catching and passing a ball) from just one short video of a human doing it?
  • Can these learned skills work in many slightly different situations (for example, the ball is thrown from a different angle or the box is in a new place)?
  • Can we avoid writing detailed “reward rules” for each task and instead use a general way for the robot to learn by imitation?
  • Can the learned skills be used on a real humanoid robot, not just in simulation?

How It Works (Methods Explained Simply)

The system is called HumanX, and it has two main parts that work together:

XGen: Making Good Practice Data from Human Videos

Think of XGen like a smart “video-to-practice” machine:

  • It watches a human video of a skill (for example, someone lifting a box or doing a basketball layup).
  • It converts the human’s movement into a version a robot can use (like matching human arms to robot arms).
  • It focuses on making the interaction physically believable, not just visually perfect. For example, if the hands hold a box, XGen ensures the box stays in the right place between the hands without magically popping or sliding through fingers.
  • It separates the action into “contact” (hands touching the object) and “non-contact” (object flying or resting) parts:
    • Contact part: It uses simple “anchor points” (like the midpoint between two palms) to keep the object correctly attached to the hands and tweaks the robot’s pose so the grip is physically stable.
    • Non-contact part: It uses a physics simulator (like a digital sandbox that obeys gravity and collisions) to create realistic object motion, such as the arc of a thrown ball.
  • It augments (varies) the data to create more practice examples from just one video:
    • Changes object size or shape (a bigger box, a different ball).
    • Shifts paths (the ball comes from a slightly different place or speed).
    • Adjusts positions (the box starts on a higher or lower shelf).

This is like turning one training clip into a big set of realistic practice drills, so the robot doesn’t just memorize—it learns to handle variations.

XMimic: Teaching the Robot to Imitate and Generalize

XMimic is the learning part—how the robot’s “brain” learns the skill:

  • It uses a “teacher–student” approach:
    • The teacher policy learns first with extra information (like precise object positions) in simulation, so it can master the skill accurately.
    • Then the student policy learns from the teacher, but with more realistic, limited information—like a real robot would have. This makes the student ready for the real world.
  • Two ways the robot perceives the world:
    • No External Perception (NEP): The robot doesn’t get camera/object tracking. Instead, it “feels” what’s happening through its own joints and forces (similar to how you can feel a ball in your hands without looking). This mode is great for skills where the robot keeps contact (like dribbling or layups) but not for catching flying objects.
    • MoCap mode: A motion-capture system gives the robot the position of objects (like the ball). Because tracking can hiccup (occlusions), XMimic trains with fake data dropouts so the robot stays stable even when sensor data briefly disappears.
  • A unified imitation reward: Rather than writing a different “score formula” for each task, XMimic uses a general setup that:
    • Encourages the robot’s body to move like the human.
    • Tracks the object’s motion correctly.
    • Keeps the right relationship between body parts and the object (e.g., hands aligned with a ball).
    • Matches contact timing (e.g., when to grab or release).
    • Stays smooth and natural.

To boost generalization (handling new variations), they also:

  • Randomly perturb starting positions and poses during training (like practicing from different starting stances).
  • Randomize physics (object weight, bounciness, friction) and add random pushes so the robot can cope with surprises.
  • Use “interaction termination” so the robot doesn’t cheat by only moving nicely—it must actually complete the interaction (e.g., catch and shoot).

Main Results and Why They Matter

The team tested HumanX on a real Unitree G1 humanoid robot and in simulation, across five areas: basketball, football, badminton, cargo pickup, and reactive fighting. They taught 10 different skills using only single human videos per skill. Key highlights:

  • Strong generalization (about 8× better than previous methods): The robot didn’t just copy a single demonstration—it adapted when the ball’s path, target location, or object position changed.
  • Real robot success:
    • NEP mode (no object sensing): The robot performed basketball skills like dribbling, layups, jumpshots, and even complex pump-fake turnaround fadeaway moves, with high success rates (around 80% on average for many moves).
    • MoCap mode (with object tracking): The robot did closed-loop interactions with a person, like over 10 back-and-forth basketball passes and more than 14 consecutive football returns.
  • Learned from single videos: Each skill came from just one human demonstration, then XGen created lots of realistic practice variations.
  • Emergent adaptive behaviors:
    • If someone took a box from the robot and set it down elsewhere, the robot walked over and picked it up again.
    • In the “fighting” demo, the robot reacted differently to feints versus real punches, showing basic interactive judgment.

Why this matters:

  • It proves robots can learn rich, human-like interactions with little data and without hand-writing specific reward rules for each task.
  • It shows a scalable path toward robots that can learn new skills from everyday videos and work in real homes, warehouses, or sports training scenarios.

Implications and Potential Impact

This research points to a future where:

  • Robots learn new skills as easily as watching YouTube clips—great for fast training without expensive, time-consuming setups.
  • Humanoid robots become more natural partners in human environments, handling objects, playing sports, assisting with chores, or cooperating with people.
  • Developers can build versatile robot skills without crafting detailed rules for every single task, saving time and making progress faster.

In short, HumanX shows that “learning by watching” can give humanoid robots agile, generalizable interaction skills, taking a big step toward robots that adapt smoothly to the messy, varied real world.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper, to guide future research:

  • Automatic contact-phase segmentation is not addressed; XGen’s contact/non-contact phases are annotated by timestamps, leaving open robust, automatic detection of contact events from video under occlusions and motion blur.
  • The anchor-based interaction representation assumes a fixed relative pose during contact (e.g., midpoint between palms), which may fail for complex interactions involving sliding contacts, regrasping, multi-point contacts, or compliance; methods for learning or adapting anchors and handling time-varying relative poses are unstudied.
  • Object geometry and pose often require manual definition when not visible; there is no validated pipeline for reliable object reconstruction from monocular video under heavy occlusion, fast motion, or challenging textures.
  • XGen’s physics synthesis relies on simplified rigid-body models and “inverted damping” for reverse simulation; the physical fidelity and stability of this approach for fast-moving, spinning, or aerodynamically affected objects (e.g., shuttlecocks, balls with spin) remains unquantified.
  • Force-closure optimization is applied frame-by-frame without trajectory-level consistency; long-horizon grasp stability, contact switching, and slip modeling are not optimized jointly, risking jitter and suboptimal contact realism.
  • Deformable and articulated objects (e.g., cloth, bags, doors, tools with joints) are not considered; extending XGen/XMimic to non-rigid or articulated interactions is unexplored.
  • Data augmentation in XGen is limited to object scaling, translations, and initial-velocity randomization; richer variations (surface friction, compliance, shape changes, textures, lighting, environmental obstacles) and their impact on generalization are not examined.
  • XMimic primarily operates with proprioception (NEP) or MoCap-based object/human poses; there is no integration of onboard vision (RGB/RGB-D) for perception, limiting deployability beyond controlled MoCap environments.
  • NEP mode cannot handle non-contact interactions requiring exteroception (e.g., catching a flying object); how to bridge that gap with onboard sensing, predictive models, or learned haptics remains open.
  • MoCap-based deployment is demonstrated in a small, controlled capture volume; robustness to larger spaces, occlusions, ID switches, latency, and multi-object tracking beyond simulated frame loss is not systematically evaluated.
  • Policies are trained and deployed on a single embodiment (Unitree G1); despite cross-embodiment claims, zero-shot or few-shot transfer to different humanoid morphologies is not tested.
  • Real-world experiments cover a limited set of object types and environments (flat floors, few obstacles); performance in cluttered, outdoor, or uneven terrains is not assessed.
  • Badminton is only evaluated in simulation; real-world validation for very high-speed interactions (e.g., shuttlecock hitting) and limits of 100 Hz control are untested.
  • The unified reward uses a reference contact graph and fixed weights; sensitivity to weight choices, failure modes when contact timing/locations deviate from demonstrations, and automatic reward tuning are not reported.
  • The student policy architecture appears to be a feedforward MLP; no use of recurrent memory for POMDP settings (e.g., occlusions, delayed MoCap), and no comparison to memory-augmented policies.
  • External force inference from proprioception is theoretically motivated but not empirically ablated; the incremental benefit versus explicit force/torque sensing and the effect of actuator model errors or unmodeled friction are unknown.
  • Domain randomization ranges (e.g., friction, restitution, CoM offsets) are not detailed or stress-tested; the contribution of each DR component to sim-to-real performance and failure modes remains unclear.
  • Generalization tests use relatively narrow perturbation ranges (e.g., ±0.3 m) and single-object settings; scalability to wider state distributions, multi-object interactions, and multi-agent scenarios is not evaluated.
  • Interaction safety with humans (e.g., fighting task) lacks formal analysis or guarantees; safe-force limits, collision detection, compliance strategies, and emergency behaviors are not specified.
  • The framework relies on single-video demonstrations per skill plus augmentation; how performance scales with more demonstrations, noisy/low-quality videos, or semantic diversity is unexamined.
  • Multi-pattern learning is shown with 3 patterns per sop; scalability to dozens of patterns, pattern compositionality, and conflict resolution among patterns are open questions.
  • High-level sequencing, task composition, and goal conditioning (e.g., via language or symbolic instructions) are not explored; skills are trained and evaluated as stand-alone behaviors.
  • XGen’s reliance on monocular pose estimators (GVHMR) inherits scale and depth ambiguities; the impact of these errors on downstream learning and how to mitigate them (multi-view, IMUs, SLAM) is not addressed.
  • Real-time latency handling (policy, MoCap, control loop) and its effect on fast interactions are not quantified; controllers for latency compensation or predictive control are absent.
  • Reproducibility details (reward weights, network sizes, optimizer settings, full DR ranges) are sparse; standardized benchmarks and public datasets for fair comparison are not provided.

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging HumanX’s demonstrated pipeline (XGen + XMimic), NEP/MoCap perception modes, and sim-to-real results on a Unitree G1.

  • Video-to-skill compilation for warehouse material handling
    • Sectors: robotics, logistics, manufacturing
    • Tools/products/workflows: “Video-to-skill” compiler that turns smartphone clips of lifts/carries into deployable policies; XGen augmentation for object size/pose variability; XMimic teacher–student training; MoCap mode for object localization; DI/DR/IT in training for robustness
    • Assumptions/dependencies: Humanoid hardware with comparable DoFs and strength to G1; task objects within trained geometry/mass ranges; safe workspace and supervision; access to GPUs and Isaac Gym (or equivalent) for training; object mesh estimation (SAM-3D) or manual specification
  • Sports/entertainment humanoids for interactive demos
    • Sectors: entertainment, sports marketing, retail experiences
    • Tools/products/workflows: NEP mode for ball-in-hand skills (dribble, layup, jumpshot, pump-fake fadeaway); MoCap mode for sustained pass/kick interactions with visitors; multi-pattern policies (teacher–student) for varied moves
    • Assumptions/dependencies: Safety perimeter and compliant control; balls and prop dynamics within trained DR ranges; venue support for MoCap (when needed) and occlusion handling
  • Proprioception-only interaction controllers for perception-degraded settings
    • Sectors: industrial robotics, field robotics
    • Tools/products/workflows: Deploy NEP student policies that infer external forces from proprioception for contact-rich tasks (e.g., carrying, dribbling, stable grasp recovery); integration with existing PD controllers
    • Assumptions/dependencies: Tasks that don’t require pre-contact object tracking; adequate coverage of contact dynamics during training; platform-specific PD gains and torque limits
  • MoCap-augmented human–robot interaction (HRI) installations
    • Sectors: museums, theme parks, showrooms, live events
    • Tools/products/workflows: MoCap-driven tracking of props/partners for closed-loop catching/passing/kicking; training with simulated MoCap frame loss for real-world robustness; scripted multi-cycle engagements
    • Assumptions/dependencies: MoCap infrastructure and calibration; clear line of sight for markers; policies trained with dropout to handle occlusions; staff trained for safe operation
  • Rapid prototyping of new humanoid behaviors in R&D
    • Sectors: robotics startups, product engineering, academia
    • Tools/products/workflows: Internal pipeline to capture a single video, auto-synthesize interaction data with XGen (including anchor definitions and force-closure refinement), and train XMimic with DI/DR/IT; A/B testing across augmentation settings; skill regression tests
    • Assumptions/dependencies: Retargeting quality (GMR) for target morphology; GPU compute; reproducible simulation setup; safety review for deployment
  • Synthetic interaction dataset generation for imitation learning
    • Sectors: AI/ML data providers, robotics research
    • Tools/products/workflows: Use XGen to expand a single demo into large, physically plausible HOI datasets (scaled meshes, diverse trajectories, contact/non-contact segments); publish benchmark splits with generalization ranges
    • Assumptions/dependencies: Reliability of human pose estimation (GVHMR) and contact segmentation; physics parameters calibrated to target platforms; licensing/consent for source videos
  • Safety and robustness testing workflows for HRI
    • Sectors: certification, QA, safety consulting, standards bodies
    • Tools/products/workflows: Test plans leveraging interaction termination (IT), domain randomization, and sustained external force injections to quantify stability and recovery; generalization success metrics (GSR) as acceptance criteria
    • Assumptions/dependencies: Agreement on metrics and thresholds with regulators; reproducibility of simulation conditions; standardized reporting
  • Educational labs and courses on video-based skill learning
    • Sectors: academia, workforce upskilling
    • Tools/products/workflows: Teaching modules that walk through XGen synthesis, anchor selection, force-closure optimization, XMimic’s unified reward, and teacher–student distillation; Isaac Gym labs; evaluation on generalization
    • Assumptions/dependencies: Access to GPUs and simulators; compatible open-source implementations or licenses; institution safety policies for real-robot demos

Long-Term Applications

These applications are plausible extensions but require further research, scaling, or infrastructure (e.g., broader perception, dexterous hands, regulatory clearances).

  • Consumer humanoids learning household skills from user videos
    • Sectors: consumer robotics, smart home
    • Tools/products/workflows: Cloud/on-device “video-to-skill” services for chores (tidying, carrying, simple tool use); anchor-based authoring for diverse objects; multi-pattern policies for variability
    • Assumptions/dependencies: Robust perception in clutter (beyond MoCap), dexterous grippers/hands, stronger generalization to household objects, privacy-preserving training, safety certification
  • On-the-fly factory upskilling by filming expert operators
    • Sectors: manufacturing, logistics
    • Tools/products/workflows: Floor supervisors record one video per task; pipeline compiles policies and augments object/tool geometries; centralized skill library and deployment tooling; easy retargeting across plant robots
    • Assumptions/dependencies: Precise non-contact perception for pre-grasp phases (vision, depth, markers); variability in fixtures and tools; end-effector standardization; line integration and safety approvals
  • Assistive and rehabilitation robots learning from clinician demonstrations
    • Sectors: healthcare, eldercare
    • Tools/products/workflows: Policies for gentle hand-overs, object delivery, mobility aid assistance; multi-pattern teacher–student to adapt to patient variability; human-state perception beyond props
    • Assumptions/dependencies: High-fidelity compliance and tactile sensing; rigorous safety/ethics oversight; regulatory clearance (FDA/CE); caregiver-in-the-loop validation; low-impact failure modes
  • Skill marketplaces and standardized libraries for humanoids
    • Sectors: software platforms, robotics ecosystems
    • Tools/products/workflows: Distribution of pre-trained XMimic policies and associated XGen datasets with metadata on domain randomization ranges and supported morphologies; ROS2 packages and CI for skill regression
    • Assumptions/dependencies: Cross-embodiment compatibility (high-quality retargeting); IP/licensing for human-derived skills; versioning, benchmarking, and governance
  • Policy and governance frameworks for training from human video
    • Sectors: policy, legal, standards
    • Tools/products/workflows: Guidelines on consent, provenance, and IP for video-sourced skills; bias and safety audits; standardized generalization/safety metrics (e.g., GSR, IT-trigger rates) for certification
    • Assumptions/dependencies: Multi-jurisdiction legal harmonization; enforceable audit requirements; stakeholder buy-in (industry, academia, consumer advocates)
  • Collaborative multi-agent interaction (humans + multiple robots)
    • Sectors: robotics, entertainment, advanced manufacturing
    • Tools/products/workflows: Extend XGen to synthesize multi-object and multi-actor interactions; student policies coordinating patterns; shared perception (multi-camera, VIO) and communication to handle occlusions/latency
    • Assumptions/dependencies: Scalable perception and time synchronization; conflict-free role assignment; safety envelopes; formal verification of coordination
  • Embodied foundation models pre-trained from human video
    • Sectors: AI research, platform robotics
    • Tools/products/workflows: Use XGen to curate massive, physically plausible HOI datasets; pretrain generalist policies or world models; align with vision-LLMs for instruction following; RLHF for safety
    • Assumptions/dependencies: Large-scale compute; curation to avoid unsafe behaviors; evaluation suites for interaction generalization and alignment
  • Disaster response and public safety robots with proprioceptive fallback
    • Sectors: emergency services, public safety
    • Tools/products/workflows: Train safe interaction policies from rescue task videos; NEP fallback when vision fails (dust/smoke); object handling and debris clearing with interaction termination safeguards
    • Assumptions/dependencies: Ruggedized hardware; expanded DR for extreme terrains; teleop override; ethical and legal frameworks for deployment
  • Sports coaching and training aids that mimic pro moves
    • Sectors: sports science, education
    • Tools/products/workflows: Compile skills from professional footage for consistent demonstrations; interactive drills (passes, returns); analytics on movement and object trajectories
    • Assumptions/dependencies: Legal rights to use footage; field-scale perception and localization; safety around athletes and spectators
  • AR authoring tools for non-technical users to specify anchors/objects
    • Sectors: creative tools, HRI, education
    • Tools/products/workflows: AR apps to annotate anchors and object poses in captured videos, preview physics-based synthesis, and export trainable interaction clips
    • Assumptions/dependencies: Robust mobile SLAM; intuitive UX for contact phase specification; automatic plausibility checks; device performance constraints
  • Robot self-improvement via continuous learning from in-situ video
    • Sectors: operations, maintenance
    • Tools/products/workflows: Robots (or supervisors) capture in-situ clips of near-misses or failures; XGen synthesizes corrective interactions; XMimic updates student policy with new patterns while preserving safety
    • Assumptions/dependencies: Reliable on-site video capture; safe on-policy updates (offline or supervised online); drift monitoring and rollback mechanisms

These applications rely on HumanX’s core innovations—physics-governed interaction synthesis (XGen), unified interaction-imitation rewards, teacher–student distillation, proprioception-based force inference, and robust training strategies (DI/DR/IT)—and are bounded by dependencies such as platform capability, perception availability, data rights, and safety constraints.

Glossary

  • 6D pose: A six-parameter representation combining 3D position and 3D orientation of a rigid body. "represents the 6D pose (3D position and 3D orientation) of the human root,"
  • Adversarial Motion Prior (AMP): A learned prior using adversarial training to encourage natural-looking motions during imitation. "and includes an adversarial motion prior (AMP) term for naturalness"
  • Anchor: A predefined reference point on the body used to maintain relative pose with an object during contact. "a predefined anchor (e.g., the midpoint between the two palms) is used."
  • Behavior Cloning (BC): A supervised learning approach that imitates expert demonstrations by directly mapping observations to actions. "While behavior cloning (BC) offers a unified training paradigm,"
  • Center of Mass (CoM) offsets: Variations in the location of a robot’s center of mass used to improve robustness via randomization. "as well as robot friction coefficients, center of mass offsets, and perception noise."
  • Coefficient of restitution: A measure of how bouncy a collision is, influencing post-impact velocities of objects. "including object size, mass, and coefficient of restitution,"
  • Contact-aware refinement: Optimization that adjusts poses considering contact constraints to improve physical plausibility. "coupled with contact-aware refinement"
  • Contact graph: A representation specifying which body parts should be in contact with objects at specific times. "deviations from the reference contact graph"
  • Coriolis: Forces arising from rotational motion that affect the dynamics of moving bodies. "the sum of inertial, Coriolis, gravitational, and frictional components."
  • Cross-embodiment: Properties or representations that transfer consistently across different robot or human morphologies. "exhibits favorable cross‑embodiment properties,"
  • Damping coefficients: Parameters that model velocity-dependent resistive forces, used to adjust simulation behavior. "object damping coefficients are inverted."
  • Degrees of Freedom (DoF): The number of independent parameters that define a robot’s configuration. "(where nn is the number of robot DoFs)"
  • Domain Randomization (DR): Training technique that randomizes environment and physical parameters to improve sim-to-real robustness. "We apply domain randomization (DR) to various physical properties"
  • Force-aware interaction: Interaction behaviors that infer and respond to external forces using proprioceptive signals. "enabling force-aware interaction without dedicated force/torque sensors."
  • Force-closure constraints: Conditions ensuring that grips can resist external disturbances and maintain stable contact. "optimized under force‑closure constraints to ensure physical plausibility during contact."
  • Gaussian distribution (policy): A common stochastic policy parameterization where actions are sampled from a Gaussian. "the policy output is parameterized as a Gaussian distribution:"
  • GVHMR: A method for estimating human 3D pose and shape from video. "using GVHMR \cite{shen2024gvhmr}."
  • Human-Object Interaction (HOI): Tasks and behaviors involving coordinated motion and contact between a human/robot and objects. "introduced Human-Object Interaction (HOI) imitation,"
  • Isaac Gym: A high-performance GPU-based physics simulation platform for training robot policies. "All training and simulation were conducted on the Isaac Gym platform"
  • Interaction Termination (IT): An episode termination strategy that prioritizes interaction success by ending runs when key interaction errors grow. "we propose Interaction Termination (IT)."
  • Inverse Kinematics (IK): A method to compute joint configurations that achieve desired end-effector positions/orientations. "IK-based optimization."
  • MoCap (Motion Capture): A sensing system that tracks object or human motion using cameras and markers. "object observations are provided by a MoCap system."
  • No External Perception (NEP) mode: A deployment setting where the policy operates without external object sensing, relying on proprioception. "a No External Perception (NEP) mode"
  • Object mesh: The polygonal 3D geometry representing an object’s shape used in simulation and pose estimation. "The object mesh and its relative pose to the anchor are estimated"
  • PD controller: A proportional-derivative control law that converts desired actions into joint torques. "via a PD controller."
  • Privileged state observation: An observation that includes extra, non-deployable information (e.g., full object state) for training teacher policies. "the policy receives a privileged state observation"
  • Proprioception: Internal sensing of the robot’s own states like joint positions and velocities. "comprises proprioception ot\boldsymbol{o}_{t}"
  • Proximal Policy Optimization (PPO): A reinforcement learning algorithm that optimizes policies with clipped objective functions for stability. "optimized using PPO \cite{schulman2017proximal}"
  • Relative motion reward: A reward term encouraging correct spatial relationships between the robot and object during interaction. "The relative motion reward $r_{t}^{\text{rel}$ encourages correct body–object relative spatial relationships,"
  • Retargeting: Mapping human motion data onto a robot’s kinematics while preserving task semantics. "Retargeting human motion to humanoids and applying reinforcement learning for imitation has shown significant promise"
  • SAM-3D: A tool for estimating 3D object meshes and poses from images or video. "using SAM-3D \cite{chen2025sam}."
  • Sim-to-real gap: Differences between simulation and real-world dynamics that hinder direct policy transfer. "the complex sim‑to‑real gap introduced by object dynamics,"
  • SMPL: A parametric human body model used to represent 3D human pose and shape. "SMPL \cite{Loper2023SMPLAS} joints."
  • Teacher-Student paradigm: A training setup where a teacher policy with privileged information is distilled into a deployable student policy. "two-stage teacher-student paradigm"
  • Unitree G1 humanoid: A specific commercially available humanoid robot used for real-world deployment. "Unitree G1 humanoid."
  • Zero-shot: Transferring or deploying learned skills to new settings without additional task-specific training. "transfers them zero‑shot to a physical Unitree G1 humanoid."

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 12 tweets with 1173 likes about this paper.