EgoInfinity: 4D Hand-Object Data Engine
- EgoInfinity is a 4D hand–object data engine that converts RGB video into metrically calibrated, robot-ready manipulation trajectories.
- It employs a modular pipeline with cross-module calibration and interaction-aware refinement to ensure accurate, physically coherent outputs.
- The system enables video-to-action robotic learning, offering scalable, annotation-free methods for robust robotic manipulation.
EgoInfinity is a web-scale 4D hand–object interaction data engine designed to automate the extraction of physically coherent, agent-agnostic manipulation trajectories suitable for robot retargeting and video-to-action robot learning from arbitrary RGB internet video. It addresses critical limitations in robot learning-from-human-video by transforming in-the-wild human manipulation clips—lacking metric 3D hand trajectories, 6-DoF object poses, and robot-executable commands—into metrically calibrated, robot-ready data at scale, without human-in-the-loop annotation (Wang et al., 16 Jun 2026).
1. System Architecture and Design Principles
EgoInfinity departs from the paradigm of static, lab-collected, or heavily annotated datasets by introducing a modular, fully automated pipeline capable of converting any internet-derived RGB video into an agent-agnostic, metric 4D hand–object interaction representation. The architecture consists of distinct modules for perception, segmentation, reconstruction, interaction-aware refinement, and functional robot retargeting, each independently upgradable as foundational models improve.
Key principles are:
- Modularity: Each processing stage is realized as an independent module (e.g., hand reconstruction, depth/scale estimation, object segmentation, object reconstruction, interaction refinement, and retargeting), ensuring that subsystem upgrades do not require re-engineering the entire pipeline.
- Cross-module metric calibration: All hand and object pose outputs are fused into a common camera-world coordinate system with real-world metric scale, using monocular depth estimation (Flow3r), scale estimation (MoGe-2), and gravity calibration (GeoCalib).
- Interaction-aware refinement: Pose estimates are physically constrained depending on the hand–object interaction state (“static”, “moving”, “grasped”), suppressing drift and ensuring physically plausible contacts.
- Cross-embodiment functional retargeting: Retargeting estimates a robot-specific “root frame” in SE(3), mapping incomplete human hand trajectories (often without full-body visibility) into joint trajectories for arbitrary robot morphologies.
The unified processing flow is: raw RGB + text prompt → 4D hand & object reconstruction → interaction-aware refinement → agent-agnostic 4D representation → robot-specific retargeting → executable joint commands.
2. Perception and 4D Reconstruction Pipeline
EgoInfinity decomposes raw video into metrically valid 4D hand–object trajectories through a series of calibrated vision and geometry modules.
- Hand reconstruction: Each frame is processed with a monocular hand-reconstruction network (WiLoR), estimating MANO parameters to generate a 3D hand mesh . Metric hand keypoints and global hand pose are recovered via scale alignment to Flow3r-predicted depth maps, optimizing for metric accuracy:
Resulting trajectories are filtered for temporal jitter.
- Object discovery, segmentation, and 3D reconstruction: The system uses a text-prompt-driven segmenter (SAM-3) for object detection, with temporal propagation (SAM-2). Depth-based point clouds are back-projected per frame. When sufficient data exists, a mesh reconstruction network (SAM-3D) produces canonical object geometry and orientation; pose tracking (FoundationPose++) provides per-frame 6-DoF object trajectories.
- Unified metric calibration: All outputs are fused into a single, metric camera-world frame via focal length and scale estimation (MoGe-2) and gravity vector recovery (GeoCalib). This alignment enables consistent, physically valid contact reasoning across all modalities.
3. Interaction-Aware Refinement
The interaction-aware refinement module suppresses drift, pose flips, and contact inconsistencies by classifying each frame’s interaction state:
- Static: Mask centroid remains within a Schmitt-trigger threshold.
- Moving: Centroid motion exceeds a threshold.
- Grasped: Hand mask overlaps the object mask by more than 30 px or any fingertip is within 6 cm of the object point cloud.
Pose refinement strategy is dictated by state:
- Static: Lock object pose to the robust centroid; canonical orientation is preserved.
- Moving: Re-solve frame-wise PnP / least-squares alignment to the canonical mesh.
- Grasped: Rigidly attach object to the hand frame using the mean relative transform over the grasp interval:
Energy minimization over refined poses is formalized as:
with strong penalties during grasp to suppress interpenetration and maintain contact correctness.
4. Functional Retargeting and Joint-Trajectory Optimization
Conventional retargeting pipelines relying on full-body visibility are infeasible for in-the-wild video. EgoInfinity compensates by learning a mapping from observed SE(3) hand trajectories (plus gravity) to a plausible robot root frame via a small, SE(3)-equivariant neural network . The network, featuring Vector-Neuron layers, is trained on synthetic hand trajectories in MuJoCo, enabling root-frame inference even in the presence of partial visibility.
Given the root frame, per-frame inverse kinematics is solved as:
where encodes joint-limit, collision avoidance, manipulability, and rest-pose priors; 0 ensures kinematic feasibility while preserving trajectory smoothness. Initialization from 1 and the root-frame estimation allows for stable tracking even with only hand modalities observed.
5. Evaluation Metrics and Benchmark Results
Quantitative evaluation employs a spectrum of metrics reflecting perception accuracy, physical plausibility, and retargeting fidelity:
| Metric | Definition / Description | Result Example |
|---|---|---|
| MPJPE | Mean Per-Joint Position Error: 2 | ≤ 3 cm for Unitree G1 |
| Geodesic Orientation Error | Mean angle between estimated and ground-truth joint rotations | Not specified |
| Contact-Point Distance | 3 over all fingertips | Not specified |
| IK Success Rate | Fraction of frames with retargeter solution within tolerances | > 80% |
| Joint-Limit Margin | Mean distance to nearest joint limit | Stable margins reported |
| Manipulability Index | 4 averaged over time | Stable, not further quantified |
| Trajectory Smoothness | Mean squared joint velocity: 5 | Not specified |
These benchmarks (see Table 3 in (Wang et al., 16 Jun 2026)) demonstrate metric reliability, contact fidelity, and successful robot retargeting for a range of morphologies and tasks.
6. Robotic Policy Learning and Real-World Execution
The end product is a time-sequence of joint targets 6. Playback on hardware platforms (e.g., dual-arm Franka FR3, LEAP dexterous hand) is accomplished via low-level PD or operational-space controllers.
- Grasping policy pretraining: Human hand trajectories serve as priors for initializing closed-loop grasping policies. On LEAP, this produces over 80% success across unseen objects, attributed to high-fidelity imitation of human hand motion.
- Video-to-action execution: Direct replay of retargeted trajectories achieves robust task success rates in manipulation tasks like cutting, pouring, and wiping, eliminating the need for reinforcement learning for these primitives.
7. Scalability, Limitations, and Prospects
EgoInfinity’s modular structure supports long-term extensibility:
- Subcomponent replacement: For example, future SLAM-aware depth networks could generalize the system to dynamic or egocentric video.
- Integration of tactile or differentiable physics estimators: This addition could move from coarse contact penalties toward enforcing force-consistent, no-slip manipulations.
- Learned end-to-end interaction refiners: Such advances may yield joint optimization of hand and object meshes, further minimizing drift.
- Retargeting enhancements: Adding learned residuals could improve finger posture accuracy or adapt to new robot morphologies with reduced need for real-data relabeling.
Current limitations stem from monocular depth noise, approximate “grasped” state modeling (no per-fingertip contact optimization), and the need for root-frame network re-training for significantly divergent robot kinematics. Nevertheless, EgoInfinity establishes a scalable, annotation-free bridge from arbitrary RGB internet video to robot-executable, physically coherent manipulation trajectories, thereby enabling open-world robotic skill acquisition (Wang et al., 16 Jun 2026).