EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

Published 6 Jun 2026 in cs.RO and cs.AI | (2606.08057v1)

Abstract: Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper introduces EgoAERO, an end-to-end framework for learning dexterous manipulation from single egocentric RGB-D videos without requiring object assets.
The approach integrates semantic preprocessing, robust trajectory reconstruction, and adaptive contact optimization to generate physically plausible hand-object trajectories.
Experimental results show significant improvements in tracking performance and policy transfer efficiency compared to CAD-based reconstruction methods.

EgoAERO: Dexterous Manipulation Policy Learning from Egocentric Video without Object Assets

Motivation and Contributions

EgoAERO introduces a comprehensive framework for learning dexterous manipulation directly from single egocentric RGB-D human demonstrations, circumventing the requirement for explicit object assets such as CAD models or pre-scanned meshes. The pipeline is motivated by the scalability and naturalness offered by everyday egocentric data, addressing the limitations of existing datasets that lack structured object pose, geometry, or contact information necessary for rigorous robot policy learning. EgoAERO presents an integrated system that reconstructs physically plausible hand-object trajectories asset-free and transfers them into executable robot policies via a two-stage residual learning process.

Figure 1: End-to-end overview of EgoAERO. Starting from a single egocentric RGB-D human demonstration, EgoAERO reconstructs contact-consistent hand-object trajectories without object assets, transfers them to a simulated dexterous hand through two-stage policy learning, and executes the learned manipulation behavior on a real-world robot.

Asset-free Egocentric Hand-Object Trajectory Reconstruction

The asset-free reconstruction component of EgoAERO is composed of several interdependent modules:

MLLM-Guided Semantic Preprocessing: Multimodal LLMs (MLLMs) parse semantic task-level elements from a small set of keyframes, generating context-aware prompts for targeted object segmentation (SAM3/SAM3D). This filtering ensures that the relevant manipulated objects and supporting elements are identified prior to low-level reconstruction.
Robust Object Tracking and Geometry Reconstruction: Without object assets, EgoAERO deploys correspondence-based memory-pool pose optimization and neural field-guided coarse-to-fine mesh generation. The system mitigates tracking drift under hand occlusion and low-texture conditions by leveraging temporally optimized local pose graphs, robust RANSAC-based correspondence filtering, and occlusion-aware neural field training.
Hand Pose Estimation and Correction: Initial hand pose estimates are derived from HaWoR MANO models and subsequently corrected for global translation using depth-aligned RGB-D information. The correction is restricted to translation, preserving the quality of articulated pose inference.
Ego Motion Compensation: Camera motion captured via head-mounted ego cameras is decoupled from the manipulation trajectory using RGB-D SLAM (ORB-SLAM3) and hand-mask-based pixel weighting, with all states mapped into a unified table frame that ensures spatial consistency.
Adaptive Contact Optimization: To address inaccuracies in fingertip position related to occlusion or pose estimation errors, EgoAERO applies temporally smoothed, geometry-level corrections to hand translation and local contact regions (thumb, opposing finger, thenar). Neither object geometry nor MANO articulation is altered, ensuring faithful replay of the original demonstration motion.
Figure 2: Overview of asset-free egocentric hand-object reconstruction, including semantic initialization, asset-free tracking and mesh generation, egocentric hand pose estimation, ego-motion compensation, and adaptive contact optimization.

Two-stage Residual Policy Learning Pipeline

EgoAERO converts reconstructed hand-object trajectories into robot-executable policies via a structured two-stage residual learning approach:

Stage I: Hand Trajectory Tracking: The policy tracks reconstructed human wrist and finger trajectories, using kinematic retargeting for initialization. The robot hand is incentivized to closely follow the original human motion in the task frame, imparting stability and reducing exploration overhead in high-dimensional action spaces.
Stage II: Object-Contact Residual Correction: Building upon Stage I, the residual policy receives object states, contact feedback, and simulates contact forces, producing corrective action deltas that optimize object trajectory tracking and contact stability. Rewards comprise both trajectory imitation and physically meaningful contact terms.

This decomposition avoids the pitfalls of sparse task-level rewards and enables robust learning from single demonstration sequences.

Data Collection and Online Quality Assessment

A core system innovation is the deployment of online quality assessment, which evaluates bounded recoverability of hand-object contact and tracking stability during data collection. Sequences are accepted, flagged as repairable, or rejected based on per-finger contact recoverability, residual penetration, and repair budget metrics. This mechanism allows EgoAERO to construct EgoDex-R—a large-scale, asset-free egocentric dataset richly annotated with hand-object states, object geometry, contact windows, and task-level metadata.

Experimental Analysis

Simulation Results

EgoAERO's simulated manipulation results exhibit competitive tracking and imitation performance compared to CAD-based reconstruction approaches on benchmark datasets. Key metrics include:

Success Rate on EgoDex-R: EgoAERO achieves 49.5% with full pipeline, substantially outperforming "Only Hand Pose" (9.8%) and "w/o Adaptive Contact Optimization" (36.2%).
Tracking Error on HOI4D: EgoAERO's asset-free trajectories yield object rotation (10.9°), translation (0.68 cm), and fingertip error (1.58 cm), closely matching CAD-asset baseline values.

These results empirically validate the physical plausibility and policy efficacy of EgoAERO's asset-free reconstructions, even in the absence of object priors.

Real-world Dexterous Manipulation

EgoAERO demonstrates successful transfer of reconstructed trajectories to hardware (Unitree G1 + Inspire Hand), producing physically executable policy rollouts from single videos. Critical to real-world viability is the adaptive contact optimization, which corrects floating or missing contacts without distorting the overall demonstration structure.

Figure 3: Qualitative demonstration of EgoAERO reconstructing a trajectory from egocentric human video and enabling dexterous robot execution.

Implications and Future Directions

EgoAERO establishes a viable methodology for policy learning from naturalistic egocentric data without object assets, greatly extending the accessibility and generalization potential of dexterous robot learning. The asset-free tracking, contact optimization, and residual policy learning pipeline provide a foundation for scalable imitation learning across arbitrary objects and everyday tasks.

The practical implication is the ability to transfer human dexterity to robots without tedious object preparation, enabling more diverse in-the-wild manipulation and expedited deployment. Theoretically, EgoAERO demonstrates that structured semantic parsing, robust geometry reconstruction, and contact-consistent replay can overcome the limitations posed by incomplete or occluded demonstrations.

Future directions include multi-hand and bimanual extension, handling severe occlusion, generalization across tasks, and minimized simulation-to-hardware training overhead.

Conclusion

EgoAERO delivers an asset-free, end-to-end framework for dexterous manipulation learning from single egocentric RGB-D videos. Through integrated modules for semantic initialization, robust tracking, hand pose estimation, ego motion compensation, and adaptive contact revision, coupled with two-stage policy learning, EgoAERO achieves physically plausible, robot-executable hand-object trajectories within both simulated and real-world settings. The results affirm asset-free reconstruction as a practical alternative to CAD-based models in policy learning regimes, with promising implications for scalable, naturalistic robotic dexterity.

Markdown Report Issue