Whole-Body Mobile Manipulation Interface
- HoMMI is a framework unifying perception, decision, and control for coordinated, whole-body mobile manipulation.
- It leverages multiple approaches including learning-from-demonstration, teleoperation, and policy-driven methods to address embodiment gaps and ensure scalability.
- The system employs multi-modal sensing and constraint-aware control to achieve high success rates in dynamic, real-world manipulation tasks while reducing operator workload.
A Whole-Body Mobile Manipulation Interface (HoMMI) defines both the hardware and algorithmic architectures enabling robots to achieve highly coordinated, bimanual, and mobile manipulation through whole-body synergy, typically by transferring or interpreting the demonstrated or intended motions of a human or operator. Systems under this umbrella address the embodiment gap, facilitate end-to-end demonstration data collection, and operationalize complex, cross-modal policies for deployment on high-DOF mobile platforms. Multiple research threads have converged on this paradigm, introducing pipeline innovations in human demonstration capture, cross-embodiment policy learning, multi-modal perception, constraint-aware control, and teleoperation interfaces for dynamic mobile manipulation.
1. Conceptual Foundations and Taxonomy
HoMMI architectures unify perception, decision, and control to enable mobile manipulators or humanoids to operate with agility in tasks that require simultaneous navigation and manipulation—often across nonholonomic bases, multiple arms, and sensorimotor feedback channels. The field encompasses diverse control strategies:
- Learning-from-Demonstration (LfD)–driven HoMMIs: These prioritize scalable data collection using pose and visual sensing to transfer human motion trajectories and gaze to robot executions, directly addressing embodiment gaps and appearance mismatches (Xu et al., 3 Mar 2026).
- Teleoperation-centric HoMMIs: Systems built around operator interfaces (exoskeletons, VR, IMUs, haptics, leader-follower arms, foot pedals) for direct whole-body or decoupled control, with emphasis on immersion, safety, and task efficiency (Belov et al., 8 Mar 2026, Purushottam et al., 2022, Purushottam et al., 26 May 2025, Purushottam et al., 2023).
- Policy-driven HoMMIs: Systems leveraging RL agents or diffusion-based models for joint or decoupled control of base and arms, often integrating intent-driven perception or modular action decoding (Honerkamp et al., 2024, Liu et al., 26 Feb 2026, Moyen et al., 3 Sep 2025).
Key challenges include managing the human–robot embodiment gap (kinematic, visual, and proprioceptive), scalable demonstration collection, whole-body redundancy, perceptual aliasing due to viewpoint shifts, and coordinated action decoding.
2. Data Acquisition and Observation Modalities
HoMMIs require high-fidelity data capturing diverse human motor skills for robust policy transfer and teleoperation. State-of-the-art frameworks operate without robot-in-the-loop teleoperation, instead using:
- Egocentric and Wrist-mounted Visual Sensing: HoMMI uses three synchronized iPhones recording RGB, depth/pointmaps, and 6-DoF poses at 60 Hz (two as wrist cameras, one as a head-cam), facilitating globally aligned, multi-view trajectory capture (without robot presence), and thus eliminating latent robot-specific biases or trajectory limitations (Xu et al., 3 Mar 2026).
- Proprioception Streams: Recording current gripper poses and widths, essential for accurately reconstructing end-effector trajectories and mirroring grasp actions.
- Operator-centered Teleop Devices: Commodity IMUs, VR controllers, foot pedals, force-feedback exoskeletons, and leader arms capture coarse-to-fine intent and pose information (Belov et al., 8 Mar 2026, Moyen et al., 3 Sep 2025).
Observation preprocessing emphasizes geometry-aware encoding, positional embedding in 3D, and removal of operator-specific artifacts (e.g., masking human arms from the visual frame). These steps produce a structured, embodiment-invariant feature embedding for downstream policy learning.
3. Cross-Embodiment Policy Learning and Perception
To robustly transfer skills from human demonstration to heteromorphic robots, HoMMI pipelines deploy policy architectures leveraging:
- Embodiment-Agnostic Visual Encoding: Head imagery is lifted into geometry-aware tokens, masked for operator body occlusion, and concatenated with spatial embeddings. Head tokens are expressed in gripper-centric coordinates to minimize viewpoint and appearance mismatches (Xu et al., 3 Mar 2026).
- Relaxed Gaze/Head Action Representation: Instead of regressing raw 6-DoF head poses (which are ill-suited for robots with limited DOF necks), HoMMI predicts a 3D look-at point , resolving this to a target direction and (if needed) constructing a rotation via projection and cross-product in SE(3). This enables active perception without infeasible head motion requirements.
- Short-horizon Conditional Prediction with Diffusion Models: The core policy maps short windows of observations to multi-step hand, gaze, and gripper actions, learning a distribution over action sequences via diffusion-based objectives:
with DDIM-based sampling for inference (Xu et al., 3 Mar 2026). The final action vector is 23-dimensional (2 hands × 9 DOF, 1 gaze, 2 gripper widths).
- Intent-Driven, Multi-Scale Feature Aggregation: Recent generative frameworks (e.g., InCoM) introduce latent intent modeling to dynamically allocate perceptual attention across multi-scale visual backbones, enabling robust perception as tasks switch between navigation-dominant and manipulation-dominant subtasks (Liu et al., 26 Feb 2026).
4. Constraint-aware Whole-Body Control
Robust deployment of whole-body policies requires certified control for high-DOF redundancy, physical limits, and task coupling. Modern HoMMI systems deploy:
- Unified Quadratic Program (QP): At up to 100 Hz, the QP minimizes a compound cost over joint-velocity increments , subject to inequality and equality constraints for joint limits, base velocities, self-collision, CoM support, and upright torso (Xu et al., 3 Mar 2026). Key cost terms include:
- Bimanual end-effector SE(3) tracking:
- Nominal posture (), smoothing (), and CoM regulation ().
- Redundancy and Nullspace Resolution: Stack-of-tasks (SoT) hierarchical WBC solvers, with high-priority (e.g., safety) constraints layered above manipulation or navigation objectives, enforce operational-space motion and safe postures on systems with joint overactuation (Arduengo et al., 2019, Moyen et al., 3 Sep 2025).
- Differential-admittance and Haptic Feedback: Variable-admittance controllers modulate compliance in the end-effector based on force cues or operator–robot interaction, ensuring safe transitions between stiff and compliant behavior and enabling haptic feedback loops (Arduengo et al., 2019, Purushottam et al., 26 May 2025, Purushottam et al., 2023).
- Dynamic Locomotion and Lean Compensation: For wheeled humanoids, reduced-order (inverted-pendulum/DCM) templates enable mapping of human body lean or pitch to robot CoM and wheel torques, with equilibrium reestimation during payload changes (Purushottam et al., 26 May 2025, Purushottam et al., 2023).
5. Teleoperation Interfaces and Human-in-the-Loop Control
Operator interfaces for HoMMI range from low-cost consumer-grade devices to fully instrumented exoskeletons:
| Interface Modality | Principal Function | Reference |
|---|---|---|
| Head-mounted IMU | Viewpoint/gaze pose for camera or "look-at" control | (Belov et al., 8 Mar 2026) |
| Leader arms/exoskeletons | Bimanual manipulator control via admittance/IK | (Belov et al., 8 Mar 2026, Purushottam et al., 2023) |
| Foot pedals/rudders | Mobile base velocity/strafe control | (Belov et al., 8 Mar 2026, Moyen et al., 3 Sep 2025) |
| Haptic force feedback | End-effector/state/force cues to the operator | (Purushottam et al., 26 May 2025, Purushottam et al., 2023) |
| VR/screen-based feedback | Immersive or multi-view observation to reduce disorientation | (Moyen et al., 3 Sep 2025) |
Architectures fuse these input streams (leader arms, IMU, pedals) at the teleop manager, mapping to robot actuation using operational-space control, QP solvers, or impedance-based tracking. Interface trade-offs include cost/latency (commodity vs. specialized), cognitive/physical workload, and data transfer efficiency.
6. Performance Evaluation and Empirical Insights
HoMMI frameworks have been validated on a wide range of manipulation and navigation tasks, with performance metrics including task success rate, completion time, robustness, workload (NASA-TLX), and imitation learning transferability:
- HoMMI Real-world Tasks: On laundry, delivery, and tablescape tasks, success rates reached 90%, 85%, and 80%, respectively, outperforming wrist-only (0–15%), RGB-only (0–45%), or head-only (0–5%) baselines. Disabling active neck motion curtailed performance by 15–25%, confirming the criticality of gaze/active perception (Xu et al., 3 Mar 2026).
- Teleoperation Efficiency: User studies with commodity-grade HoMMI show 30–40% reduction in operator task time and 30% lower cognitive workload compared to keyboard or touchscreen control (Belov et al., 8 Mar 2026).
- Imitation Learning Generalization: Data collected with coupled whole-body teleoperation leads to markedly higher imitation-learning policy success (80% vs. 0%) on tasks requiring fine arm–base coordination (Moyen et al., 3 Sep 2025), and policies trained from as few as five whole-body MoMa-Teleop demonstrations transfer zero-shot to new obstacles, while whole-body GMM baselines fail entirely (Honerkamp et al., 2024).
- Balance and Dynamic Payload Handling: For dynamic mobile manipulation (heavy lifts, pushing 105% mass boxes), advanced HoMMIs achieve DCM tracking errors below 0.05 rad with haptic assist, and all pilots preferred automatic lean compensation (Purushottam et al., 26 May 2025, Purushottam et al., 2023).
7. Limitations, Open Problems, and Future Directions
Current HoMMI systems face several open challenges:
- Observation Horizon and Memory: Policies built on short observation stacks (e.g., ) can struggle with occlusion, and lack explicit memory or belief-state reasoning (Xu et al., 3 Mar 2026).
- Embodiment Gap Residuals: Differences in camera placements, hand kinematics, or finger compliance can cause transfer drop; design co-adaptation of data-collection instruments and robot effectors is suggested as a remedy.
- Perception-Action Scalability: Extending intent-driven perceptual modulation (as in InCoM) to longer time horizons, language-specified tasks, or sim2real regimes is an active research direction (Liu et al., 26 Feb 2026).
- Teleop Comfort and Fatigue: VR feedback increases cognitive+physical workload by ~20% and can slow operators by ~30%; ergonomics (RULA) should be actively monitored, with practical session design to mitigate risk (Moyen et al., 3 Sep 2025).
- Haptic Feedback Coverage: While haptic interfaces greatly aid task performance and immersion, affordable and robust generalization to multi-DOF force/torque cues remains an engineering challenge.
Despite these challenges, the Whole-Body Mobile Manipulation Interface paradigm forms the backbone for scalable, robust, and safe deployment of mobile manipulators in unstructured human environments, and provides data and policies for advancing general-purpose robotic manipulation (Xu et al., 3 Mar 2026, Belov et al., 8 Mar 2026, Honerkamp et al., 2024, Purushottam et al., 26 May 2025, Purushottam et al., 2023, Liu et al., 26 Feb 2026, Moyen et al., 3 Sep 2025, Arduengo et al., 2019).