Human2LocoMan: Cross-Embodiment Loco-Manipulation

Updated 20 March 2026

Human2LocoMan is a cross-embodiment framework that transfers human task specifications to robots, enabling integrated locomotion and manipulation.
The framework employs a modular transformer architecture that aligns multi-modal sensory inputs with coordinated action outputs for complex tasks.
Teleoperation and behavioral cloning methods demonstrate significant improvements in task success, promoting efficient, long-horizon robot control.

Human2LocoMan is a family of frameworks for transferring human-level task specification, demonstration, or reasoning to embodied robot agents capable of integrated locomotion and manipulation (“loco-manipulation”). The term encompasses imitation-based cross-embodiment learning for quadrupedal robots (Niu et al., 19 Jun 2025), hierarchical demonstration-to-policy pipelines for humanoids (Fu et al., 13 Oct 2025), and foundation-model-based end-to-end language-to-control systems. The frameworks target versatile, long-horizon, and generalizable manipulation beyond fixed-arm setups, leveraging unified data representations, modular learning architectures, and coordinated control.

1. Teleoperation, Data Collection, and Representation Alignment

A core challenge in Human2LocoMan systems is the unification of observation and action spaces across human and robot embodiments. In (Niu et al., 19 Jun 2025), teleoperation is achieved via a VR headset (Apple Vision Pro XR) that streams first-person stereo images and precise SE(3) head and wrist poses. A world-aligned unified frame $\mathcal{F}_u$ , attached to each embodiment’s main camera, standardizes coordinate systems. Human motions are linearly and rotationally mapped to robot torso and end-effector (EEF) targets by:

Translational mapping:

$x^{\mathrm{torso}}_{u,t} = x^{\mathrm{torso}}_{u,0} + \alpha_{\mathrm{torso}}(x^{\mathrm{head}}_{u,t} - x^{\mathrm{head}}_{u,0})$

Rotational mapping:

$R^{\mathrm{torso}}_{u,t} = R^{\mathrm{torso}}_{u,0}(R^{\mathrm{head}}_{u,0})^{\top}R^{\mathrm{head}}_{u,t}$

Gripper state maps from human finger-tip distance to gripper angle.

A QP-based whole-body controller enforces kinematic and dynamic safety (via joint limits, manipulability and collision checks), and data are stored as time-aligned observation–action sequences in unified vector formats for both humans and robots.

2. Modularized Cross-Embodiment Transformer Policy (MXT)

The MXT architecture (Niu et al., 19 Jun 2025) implements scalable cross-embodiment learning via a modality-aligned modular transformer. Key structural elements are:

Per-modality tokenizers: Each sensory or proprioceptive observation (main image, wrist image, body/EEF/relative poses, gripper state) is encoded into a fixed token pool by small encoders (e.g., ResNet18 for images, MLPs for states).
Transformer trunk: A shared encoder–decoder transformer aggregates the modality tokens and outputs latent action representations.
Per-modality detokenizers: Specialized decoders (via cross-attention heads) generate the action chunks for each modality.

Observational and action modalities are masked or unmasked depending on embodiment (e.g., the human lacks wrist cameras, unimanual mode lacks certain EEFs). This modular alignment allows for pretraining the trunk on large-scale human data and transferring it with minimal robot data through reinitialized tokenizers/detokenizers. Cross-embodiment generalization is enforced by structural alignment, not explicit alignment losses.

3. Learning Objectives and Mathematical Formulation

Behavioral cloning is the principal training method in Human2LocoMan frameworks. The cross-embodiment loss is formulated as:

$\mathcal{L}_{e,m}(B_e) = \frac{1}{n}\sum_{j=1}^n\sum_{\ell=1}^h \big\| a_{j,\ell}[m] - \hat a_{j,\ell}[m] \big\|_1$

for batch $B_e$ of size $n$ , horizon $h$ , embodiment $e$ , and modality $m$ .

The total loss on an embodiment is: $\mathcal{L}_{e}(\theta) = \sum_{m} \mathcal{L}_{e,m}(\theta)$

Training proceeds in two stages:

Pretraining on human data: $\min_\theta \mathcal{L}_{\text{human}}(\theta)$
Finetuning on robot data (with module reinitialization): $\min_\theta \mathcal{L}_{\text{LocoMan}}(\theta)$

No auxiliary cross-embodiment or adversarial losses are required; modality structure and shared trunk suffice.

4. Datasets and Experimental Protocols

Six real-world long-horizon household tasks: unimanual & bimanual toy collection, shoe-rack organization, unimanual scooping, bimanual pouring.
Human dataset: 210–340 trajectories/task (20–100 min/operator), main egocentric video, body and wrist/EEF pose, relative pose, grasping state.
LocoMan robot dataset: 64–150 trajectories/task (7–23 min/task), main stereo + wrist RGB, full proprioceptive state, target 6D poses, gripper state.
Data are aligned in chunked sequences, enabling direct cross-domain training.

Comparative and Transfer Protocols

In-distribution (ID) and out-of-distribution (OOD) test splits with novel object configurations.
Baselines: Human-teleop-imitation transformer (HIT), MXT scratch (robot data only), MXT with human-pretraining.

5. Empirical Results and Ablations

Human2LocoMan delivers substantial performance benefits compared to standard policy learning and direct robot imitation:

Mode	Success Rate (ID)	Success Rate (OOD)
HIT baseline	baseline	baseline
MXT scratch	↑	↑
MXT human pretraining	+41.9%	+79.7%
Pretrain MXT (vs. robot-only, half data)	+38.6%	+82.7%

Unimanual toy: SR (~70%→95% ID, ~33%→92% OOD).
Bimanual pouring: SR (~58%→91% ID, ~17%→83% OOD).

Pretraining accelerates early substep accuracy and enhances long-horizon task robustness, with validation losses showing reduced overfitting. Ablation on MXT modularity (MXT-Agg, with aggregated tokenizers/detokenizers) reveals diminished transfer, confirming that fine-grained modality decomposition is essential for cross-embodiment transfer.

6. Implementation and Architectural Details

Transformer trunk: 4 encoder + 4 decoder layers, hidden dim=128, FF dim=256, 16 heads, encoder/decoder dropout=0.1 (0.4–0.5 pretraining).
Tokenizers: ResNet18→16 tokens (main image), ResNet18→8 tokens (wrist), body/EEF/relative-pose MLP→4 tokens, gripper→4 tokens.
Detokenizers: 6 tokens/action modality, horizon $h=60$ –180 per step chunk.
Optimization: AdamW, pretraining learning rate $1\times10^{-4}$ , finetuning/scratch $5\times10^{-5}$ , weight decay $1\times10^{-4}$ , batch size 16–24.
Training schedule: 60k–100k steps depending on task.

7. Significance, Limitations, and Extensions

Human2LocoMan demonstrates that unified data alignment, modular policy architectures, and human demonstration pretraining unlock scalable, efficient learning of versatile loco-manipulation. Its modular approach (MXT) supports rapid co-adaptation between new robot platforms and abundant human demonstration data.

Limitations include the reliance on high-fidelity teleoperation hardware and the requirement for structural modality alignment, which may require reengineering for embodiment-specific sensors or actuation topologies. A plausible future direction is the integration of Robotic Foundation Models or planning-capable LLM-based architectures for open-vocabulary, instruction-driven whole-body control, as seen in recent extensions of the Human2LocoMan paradigm to humanoids (Fu et al., 13 Oct 2025, Hao et al., 13 Apr 2025, Ren et al., 11 Mar 2026).

For state-of-the-art open-source resources, all code, hardware, and data are available at https://human2bots.github.io (Niu et al., 19 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining (2025)

DemoHLM: From One Demonstration to Generalizable Humanoid Loco-Manipulation (2025)

Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation (2025)

Cybo-Waiter: A Physical Agentic Framework for Humanoid Whole-Body Locomotion-Manipulation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Human2LocoMan Framework.

Human2LocoMan: Cross-Embodiment Loco-Manipulation

1. Teleoperation, Data Collection, and Representation Alignment

2. Modularized Cross-Embodiment Transformer Policy (MXT)

3. Learning Objectives and Mathematical Formulation

4. Datasets and Experimental Protocols

Human2LocoMan (Niu et al., 19 Jun 2025)

Comparative and Transfer Protocols

5. Empirical Results and Ablations

6. Implementation and Architectural Details

7. Significance, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Human2LocoMan: Cross-Embodiment Loco-Manipulation

1. Teleoperation, Data Collection, and Representation Alignment

2. Modularized Cross-Embodiment Transformer Policy (MXT)

3. Learning Objectives and Mathematical Formulation

4. Datasets and Experimental Protocols

Human2LocoMan (Niu et al., 19 Jun 2025)

Comparative and Transfer Protocols

5. Empirical Results and Ablations

6. Implementation and Architectural Details

7. Significance, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics