HuMI: Humanoid Manipulation Interface

Updated 3 July 2026

HuMI is a hardware–software abstraction that bridges human demonstrations and teleoperation inputs with modular, scalable control of high-DOF humanoid robots.
It employs modular decomposition and advanced retargeting techniques to achieve precise hand–eye coordination, grasp primitives, and locomotion tracking using diverse sensor inputs.
HuMI supports both direct teleoperation with low-latency feedback and robot-free data collection, enabling robust policy learning across complex, unstructured environments.

A Humanoid Manipulation Interface (HuMI) is a hardware–software abstraction enabling efficient, scalable, and robust mapping between human demonstrations or teleoperation inputs and whole-body humanoid robot control. HuMI solutions span a spectrum from modular teleop rigs and exoskeletons to VR+keypoint and MoCap-driven robot-free data-collection rigs, supporting both direct operator control and large-scale offline policy learning for coordinated manipulation, locomotion, and sensory-interactive tasks in unstructured environments (Qi et al., 31 Dec 2025).

1. Core Architectural Principles and Modularization

HuMI architectures emphasize modular decomposition of humanoid control into independent submodules, which typically include (i) hand–eye coordination, (ii) grasp primitives, (iii) arm end-effector tracking, and (iv) locomotion control (Qi et al., 31 Dec 2025, Myers et al., 31 Jul 2025). Operator input streams are mapped to these modules via VR controllers, exoskeletons, wearable IMU suits, or VR+keypoint-based interfaces, with all signals timestamped and synchronized for unified control over 44–55 DOF robot morphologies. Submodules can generally be activated independently, e.g., “trigger to engage arm IK,” “button for power or precision grasp,” or “joystick to toggle walking mode.”

Key architectural features include:

Parallel submodules: Decoupling high-DOF control for the head, arms, hands, and legs into parallel, independently triggerable modules.
Intuitive operator mappings: E.g., direct joint-space mapping in exoskeletons (Myers et al., 31 Jul 2025, Ben et al., 18 Feb 2025), hand–eye tracking via simple trigger (Qi et al., 31 Dec 2025), pedal-based lower-body policy control (Ben et al., 18 Feb 2025), velocity/acceleration-based body-lean telelocomotion (Purushottam et al., 2022), and VR/IMU-driven keypoint rigs (Nai et al., 6 Feb 2026, Zhao et al., 17 Jun 2026).
Unified data logging: All (observation, action) streams are logged for scalable learning. Synchronization across human-side and robot-side inputs (IMUs, cameras, encoders, force-torque) is realized at tens of ms latency (Heng et al., 12 Mar 2026, Myers et al., 31 Jul 2025).

This modularity reduces operator fatigue, enables targeted focus on specific subskills, and supports efficient, high-quality demonstration collection required for high-performing learning-based whole-body manipulation policies (Qi et al., 31 Dec 2025).

2. Human Demonstration and Teleoperation Methods

HuMI teleoperation and demonstration collection span direct teleoperation (robot in the loop) and robot-free recording. Approaches include:

Direct teleoperation: Exoskeletons and force-feedback leader–follower rigs provide direct, low-latency joint-space or pose control, often with adaptive force feedback to prevent joint overtravel and provide haptic cues (Myers et al., 31 Jul 2025, Ben et al., 18 Feb 2025). High-frequency actuation with closed-loop latencies <15 ms and tracking errors under 2°/joint are typical (Myers et al., 31 Jul 2025). Adaptive virtual springs and torque feedback loops enhance safety and immersion.
VR/IMU/Motion-capture rigs: Portable VR+IMU suits or markerless motion-capture solutions capture full human skeleton kinematics, augmented by instrumented hand controllers or gloves for high-DOF dexterous hand tracking (Heng et al., 12 Mar 2026, Nai et al., 6 Feb 2026, Zhao et al., 17 Jun 2026, Yu et al., 5 May 2026). These may capture 5–7 key SE(3) frames (pelvis, hands, feet, knees), supporting fine-grained motion retargeting.
Hands-free/Body-lean telelocomotion: HMI-driven body-pitch and twist control directly map operator body state to robot base velocity or acceleration, while freeing hands for simultaneous manipulation (Purushottam et al., 2022, Purushottam et al., 2023). This facilitates dynamic mobile manipulation and heavy-load collaboration by mapping human DCM and CoP to robot equilibrium and force modes (Purushottam et al., 26 May 2025).
Visual/tactile adaptation: In hand-focused HuMI variants, wearable exoskeletons are visually adapted via segmentation and domain adaptation pipelines to minimize the embodiment gap for vision-guided manipulation (Xu et al., 28 May 2025).

3. Motion Retargeting, Inverse Kinematics, and Task-Space Abstractions

Central to HuMI is the problem of retargeting high-dimensional, morphology-mismatched human demonstration to robot execution. Standard solutions include:

Whole-body inverse kinematics (IK): At each demonstration timestep, the optimal joint configuration is solved as:

$q^*(t) = \underset{q \in \mathcal{Q}}{\arg\min} \sum_i \|f_i(q) - p_i^\mathrm{tracker}(t)\|^2 + \lambda \|q - q_\mathrm{nom}\|^2, \quad \text{s.t.}\; q_{\min} \le q \le q_{\max},\;\text{collision avoidance}$

Here, $f_i(q)$ is the forward kinematics of keypoint $i$ , and regularization or Jacobian-based warm-start are frequently used for rapid convergence (Nai et al., 6 Feb 2026, Wang et al., 25 Jun 2026, Qi et al., 31 Dec 2025).

Contact-flow/task-space interfaces: Modern HuMI abstractions expose structured, low-dimensional spaces to mid- and high-level planning. In “CEER” (Luo et al., 19 May 2026), a 16D task space comprising root pose and two 6-DOF end-effector targets is combined with impedance-compliant whole-body RL controllers, aligning task-space compliance with plug-and-play skill integration.
Contact-flow representations: The “OmniContact” framework (Yu et al., 24 Jun 2026) exploits a hybrid state of body trajectory anchors and binary contact indicators, enabling both robust RL skill libraries and symbolic planning through phase templates and heuristic re-anchoring for closed-loop recovery.

Kinematic/dynamic mapping accuracy is consistently high, with end-effector RMSE as low as 3.3 cm and joint tracking error under 2° in leading platforms (Myers et al., 31 Jul 2025, Luo et al., 19 May 2026), supporting dynamic whole-body tasks and generalization to diverse scenarios.

4. High-Level Policy Learning and Multimodal Imitation

HuMI solutions are tightly integrated with scalable policy learning pipelines, moving beyond conventional behavior cloning:

Choice-Policy learning: Generates multiple candidate action sequences in parallel, scoring each via a regression head trained to predict negative MSE error. At inference, the candidate with the highest predicted score is executed, allowing representation of multimodal, phase-specialized subskills with sub-10 ms latency (Qi et al., 31 Dec 2025).
Diffusion-based and chunked-policy methods: Flow-matching or chunked policies (e.g., diffusion U-Nets, transformers) predict short-horizon keypoint/action sequences conditioned on visual and proprioceptive observations (Nai et al., 6 Feb 2026, Yu et al., 5 May 2026, Heng et al., 12 Mar 2026). These are retargeted and tracked by RL motion controllers in closed-loop.
Two-stage or co-training frameworks: Initial training on large-scale human motion data is followed by fine-tuning on robot observations, effectively bridging the human–robot embodiment gap and enabling generalization to novel objects, backgrounds, and task configurations (Heng et al., 12 Mar 2026, Hu et al., 20 Jun 2026).
Plug-and-play semantics: Hierarchical architectures (e.g., CEER, OmniContact) allow mid-level planners, LLM-based semantic decompositions, or symbolic programs to emit standardized command sequences over HuMI abstractions, supporting diverse and hierarchical skill chaining, re-planning, and autonomous recovery (Luo et al., 19 May 2026, Yu et al., 24 Jun 2026).

Empirical results show that these learning frameworks, trained on HuMI-collected data, outperform diffusion policies or naive behavior cloning in consistency, subphase specialization, and success under out-of-distribution (OOD) task perturbations (Qi et al., 31 Dec 2025, Hu et al., 20 Jun 2026).

5. Task Domains, Performance, and Empirical Evaluations

HuMI interfaces have been validated on a spectrum of complex tasks requiring whole-body coordination, dynamic stabilization, contact-rich interaction, and long-horizon skill chaining. Representative benchmarks:

System	Task Example	Success Rate	Tracking Error	Data Efficiency
HuMI+Choice	Dishwasher loading	up to 10/10	—	—
	Whiteboard wiping (whole-body)	doubles BC	—	—
CHILD	Box pick-and-place + walk	100% (5/5)	<2°/joint	Latency <15 ms
HumDex	5-way loco-manipulation	91.7% demo	—	44 min/60 demos
BifrostUMI	Pick-and-place (robot-free)	100%	1.8±0.5 cm	20 min/500 demos
HumanoidUMI	Bimanual, dynamic, walking tasks	up to 95%	—	60+ demos/10 min
CEER	Room-scale loco-manipulation	~70%	3.3 cm (EE)	—
OmniContact	Chained box manipulation	98.7%	0.07 m obj	22 h MoCap/Megaframes
HALOMI	Navigation, bimanual, dynamic	85–90%	∼4–12 cm EE	300 demos / 3 tasks

Across platforms, HuMI methodology consistently increases both data-collection throughput and policy generalization. Robot-free rigs achieve 2x–3x demonstration rates versus teleoperation, and task success rates of 70–100% in both in-distribution and many OOD settings (Nai et al., 6 Feb 2026, Wang et al., 25 Jun 2026, Qi et al., 31 Dec 2025). Empirically, modular design and precise retargeting methods are central for success in tasks involving severe occlusion, contacts, and shifting workspace geometry.

6. Limitations, Open Challenges, and Future Directions

Persisting limitations include:

Embodiment gap: Even with view alignment and retargeting, systematic errors remain for large geometric mismatches and novel world layouts (Zhao et al., 17 Jun 2026). Future work: explicit lower-body/scene geometry priors; extended keypoint sets.
Sensing infrastructure reliance: Most current systems depend on instrumented VR trackers, IMUs, or motion-capture; markerless and vision-only pipelines are under development (Nai et al., 6 Feb 2026).
Control unification: Many systems require per-task controller retraining. Universal, morphology-agnostic whole-body controllers are an ongoing frontier (Heng et al., 12 Mar 2026, Nai et al., 6 Feb 2026, Wang et al., 25 Jun 2026).
Latency, dynamics, and generalization: Extremely fast or highly dynamic behaviors (e.g., throwing) are limited by closed-loop latency and the representational power of learned controllers (Wang et al., 25 Jun 2026, Zhao et al., 17 Jun 2026).
Haptic and tactile feedback: Only a subset of interfaces deliver bilateral force feedback or use touch sensing to close the action–perception loop. Integrating tactile sensing and haptics remains vital for scaling to contact-rich and delicate manipulation (Xu et al., 28 May 2025, Myers et al., 31 Jul 2025).

Anticipated advances entail integration of multi-modal sensing (vision, touch, force), fine-grained egocentric data, universal action spaces (e.g., contact-flow, EE-root, morphology-invariant keypoints), and scalable training leveraging both robot-free and on-robot demonstration pools. The open-source release of HuMI hardware designs and learning pipelines is catalyzing broad uptake and new task domains (Heng et al., 12 Mar 2026, Myers et al., 31 Jul 2025, Ben et al., 18 Feb 2025).

Key references: (Qi et al., 31 Dec 2025, Heng et al., 12 Mar 2026, Luo et al., 19 May 2026, Myers et al., 31 Jul 2025, Nai et al., 6 Feb 2026, Zhao et al., 17 Jun 2026, Yu et al., 24 Jun 2026, Wang et al., 25 Jun 2026, Xu et al., 28 May 2025, Ben et al., 18 Feb 2025)