Ego Humanoid Manipulation Benchmark

Updated 2 July 2026

Ego Humanoid Manipulation Benchmark is a unified framework that evaluates vision–language–action policies on humanoid robots using egocentric human demonstrations and systematic alignment techniques.
It employs a detailed data-collection protocol and modular task decomposition for loco-manipulation tasks like pillow placement, trash disposal, toy transfer, and cart stowing.
Empirical results show significant improvements in generalization and transfer learning, with co-training yielding up to 64 percentage points gain over robot-only baselines.

The Ego Humanoid Manipulation Benchmark defines a rigorous framework for evaluating vision-language-action policies on humanoid robots by leveraging egocentric human demonstrations, real robot data, and systematic cross-embodiment alignment techniques. Developed to address the data scarcity and embodiment gap endemic in whole-body loco-manipulation, it provides a unified suite of in-the-wild tasks, precisely parameterized evaluation metrics, scalable data-collection infrastructure, and reproducible alignment and learning protocols. The benchmark has demonstrated substantial improvements in generalization and manipulation-locomotion integration relative to robot-only or human-only data regimes, and serves as a canonical setting for scaling experiments, ablation studies, and transfer learning in humanoid robot learning research (Shi et al., 10 Feb 2026).

1. Benchmark Definition, Scope, and Task Taxonomy

The Ego Humanoid Manipulation Benchmark encompasses four real-world loco-manipulation tasks that require coupled whole-body locomotion and manipulation:

Pillow Placement: Walk while carrying a pillow, squat at a bed, and place the pillow on a deformable surface.
Trash Disposal: Walk to a covered bin while holding trash, and insert it horizontally into the opening.
Toy Transfer: Approach a toy, grasp with both hands, walk to a distant table, and place it down.
Cart Stowing: Push a cart, grasp a toy from a shelf, deposit it into the cart, and push the cart away.

Each task is hierarchically decomposed into a sequence of locomotion and manipulation stages $\{S_1,\ldots,S_K\}$ . Stages are individually parameterized by success thresholds on end-effector position/orientation, pelvis pose, or object placement error. The overall trial score is the average of per-stage binary success indicators: $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ This modular decomposition enables granular analysis of subtask transfer and failure cases. The benchmark explicitly measures both stage-wise (locomotion vs. manipulation) and aggregate success rates (Shi et al., 10 Feb 2026).

2. Dataset Construction and Collection Protocols

A portable egocentric hardware rig has been devised, integrating a PICO VR headset, five full-body trackers (100 Hz), Dex3 hand-joint capture (26 keypoints/hand), and a head-mounted ZED X Mini RGB camera (960×540 @ 20 Hz). The same system is employed for both robot teleoperation (via hand controllers for discrete locomotion and wrist-pose IK) and human demonstration (no robot present, full body and egocentric RGB only).

Human demonstration: In-the-wild indoor/outdoor scenes (homes, stores, parks) spanning diverse lighting, clutter, object types, and layouts. Approximately 1,200 episodes (~13 hours) were collected (~300 per task, ~40 s each). Task prompts are administered in natural language and demonstrators are instructed to maintain hand visibility and stable posture.
Robot teleoperation: Laboratory-controlled variants of the same tasks, teleoperated at ~60 s per episode, yielding 100 episodes per task (400 total).

This infrastructure enables scalable and diverse data capture under ecologically valid and transfer-amenable conditions (Shi et al., 10 Feb 2026).

3. Alignment Pipeline for Cross-Embodiment Transfer

The core of the benchmark is a systematic human-to-humanoid alignment pipeline:

A. View Alignment

Depth estimation: MoGe is applied to the egocentric human RGB stream to obtain per-pixel depth maps.
3D reprojection: Human image points are backprojected into 3D and transformed (rotation $R_{rh}$ , translation $t_{rh}$ ) into the robot's camera frame.
2D projection: Resulting points are projected into the robot camera to yield a 'warped' image; missing pixels (e.g., from occlusion) are inpainted via a latent-diffusion network. The process is fully deterministic.

B. Action Alignment

Upper body: Motion encoded as 6-DoF delta poses $a_t^e=(\Delta x, \Delta q)$ , with smoothing (Savitzky–Golay in $\mathbb{R}^3$ , so(3)) and downsampling (100 Hz→20 Hz).
Lower body: Pelvis trajectories converted to discrete locomotion commands $c_t\in\{\text{forward,backward,left,right,turn}_\ast,\text{stand,squat}$ \ via velocity, yaw rate, and height change quantization.
Gripper: Binary grasp inferred from average finger joint curvature ( $\kappa_f$ ) with thresholding.

The aligned 18D action vector at time $t$ is: $a_t = [a_t^e (12),\; c_t (3),\; a_t^g (2),\; \Delta h (1)]^T$ .

This pipeline bridges the embodiment gap in both perceptual and kinematic domains, enabling robust transfer across human and robot morphologies (Shi et al., 10 Feb 2026).

4. Vision–Language–Action Policy and Training Regimes

The benchmark policy is a vision-language-action network based on the π₀.₅ backbone, accepting aligned egocentric RGB (224×224) and task prompt text, and outputting the 18D action vector without proprioceptive inputs. Core components include:

Fusion: ViT and Transformer heads create a joint RGB+text embedding.
Decoders: Multi-head structure jointly regresses Δ-pose ( $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 0), classifies locomotion command ( $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 1), gripper state ( $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 2), and matches height change ( $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 3). The aggregate imitation loss is:

$\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 4

Training schedule: Multi-source sampling with tunable ratio (e.g. 2:1 human:robot) leverages diverse navigation priors (human) against precise manipulation supervision (robot).

This approach facilitates effective transfer learning and generalization by capitalizing on the complementary coverage of each data modality (Shi et al., 10 Feb 2026).

5. Evaluation Protocols and Quantitative Results

Metrics:

Task completion (average per-stage success, $\text{Score} = \frac{1}{K} \sum_{k=1}^K s_k,\quad s_k\in\{0,1\}$ 5)
Stage-wise locomotion and manipulation success rates
Failure-mode breakdowns

Baselines and experimental settings:

Robot-only (lab, 100 eps/task)
Human-only (wild, 300 eps/task)
Co-training (joint)
In-domain (lab scenes), generalization (scenes unseen by robot)

Key results:

Setting	In-Domain (%)	Generalization (%)
Robot-only	59	31
Co-training	78 (+19 pp)	82 (+51 pp)

Per-task generalization gains reach +43 to +64 percentage points. Ablations reveal 5–12 pp performance drop on view-alignment removal, largest for tasks with high object height variability. Increasing human demonstration volume consistently improves generalization, with optimal human:robot ratio depending on required manipulation precision (Shi et al., 10 Feb 2026).

6. Subtask Transfer, Scaling Laws, and Failure Analysis

Subtask transfer: Human-only excels at navigation (100% S₁ success), but manipulation from human data alone is less reliable (e.g., Cart task S₂: 5% human-only, 15% robot-only), with co-training recovering to 60%.
Failure modes: Robot-only failures are 45% locomotion, 55% manipulation; human-only failures are 75% manipulation. Co-training balances and reduces failures (<15% each).
Scaling: As human demonstration volume increases (0 to 3× robot data), generalization improves. The optimal human:robot sampling ratio is task-dependent, with coarse-grasp favoring more human data, precision tasks more robot data.

Sankey analysis shows human-only models fail chiefly on high-precision manipulation. Co-training reduces both navigation and manipulation failures (Shi et al., 10 Feb 2026).

7. Benchmark Significance and Broader Context

The Ego Humanoid Manipulation Benchmark is distinctive for coupling real-world, whole-body loco-manipulation in ecologically valid environments with a fully unified human and robot dataset, principled alignment pipeline, and reproducible evaluation suite. Its advances are orthogonal to simulation-only avenues (e.g., HumanoidArena (Wang et al., 16 Jun 2026)) by introducing systematic cross-domain transfer at scale, directly addressing the embodiment gap in both perception and action.

The benchmark operationalizes findings that human demonstrations—when properly aligned—are sufficient to learn navigation and high-level manipulation logic, but that precise or dexterous actions may still benefit from explicit robot data. This suggests a hybrid paradigm for scaling data-driven humanoid learning. The modular scoring, alignment, and evaluation protocols define a robust foundation for downstream research in generalization, scaling, and transfer for humanoid VLA policy learning and provide a template for extensions to longer-horizon, multi-contact, or multi-agent domains (Shi et al., 10 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

EgoHumanoid: Unlocking In-the-Wild Loco-Manipulation with Robot-Free Egocentric Demonstration (2026)

HumanoidArena: Benchmarking Egocentric Hierarchical Whole-body Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ego Humanoid Manipulation Benchmark.