LIDEA: Human-to-Robot Imitation Learning via Implicit Feature Distillation and Explicit Geometry Alignment

Published 12 Apr 2026 in cs.RO and cs.CV | (2604.10677v1)

Abstract: Scaling up robot learning is hindered by the scarcity of robotic demonstrations, whereas human videos offer a vast, untapped source of interaction data. However, bridging the embodiment gap between human hands and robot arms remains a critical challenge. Existing cross-embodiment transfer strategies typically rely on visual editing, but they often introduce visual artifacts due to intrinsic discrepancies in visual appearance and 3D geometry. To address these limitations, we introduce LIDEA (Implicit Feature Distillation and Explicit Geometric Alignment), an imitation learning framework in which policy learning benefits from human demonstrations. In the 2D visual domain, LIDEA employs a dual-stage transitive distillation pipeline that aligns human and robot representations in a shared latent space. In the 3D geometric domain, we propose an embodiment-agnostic alignment strategy that explicitly decouples embodiment from interaction geometry, ensuring consistent 3D-aware perception. Extensive experiments empirically validate LIDEA from two perspectives: data efficiency and OOD robustness. Results show that human data substitutes up to 80% of costly robot demonstrations, and the framework successfully transfers unseen patterns from human videos for out-of-distribution generalization.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a dual-stage pipeline achieving cross-embodiment alignment via implicit 2D feature distillation and explicit 3D geometry canonicalization.
It reduces required robot demonstration data by up to 80% while maintaining high manipulation success across diverse tasks.
Methodology ablation confirms the necessity of staged feature and geometric alignment for robust, out-of-distribution performance.

LIDEA: Implicit Feature Distillation and Explicit Geometry Alignment for Human-to-Robot Imitation

Problem Statement and Motivation

Robust, generalizable visuomotor policies for robot manipulation are fundamentally constrained by the substantial cost and limited diversity of obtaining robot demonstration data. Conversely, human demonstration videos are abundant and diverse, but leveraging this data for direct policy transfer remains a challenge due to the "embodiment gap": discrepancies in appearance, 3D geometry, and semantics between humans and robots. Previous visual editing techniques or unified representation approaches fall short due to artifacts, kinematic incongruities, or reliance on fragile state estimation. LIDEA introduces a rigorous framework that enables human-to-robot imitation learning by integrating both implicit 2D feature distillation and explicit 3D geometric alignment, systematically bridging this gap.

LIDEA Framework Overview

LIDEA decomposes the cross-embodiment gap into two complementary domains: 2D visual representation and 3D geometric observation. The framework consists of:

Dual-stage transitive feature alignment in 2D: Establishes semantic equivalence across human, pseudo-robot, and real-robot domains via staged distillation.
Explicit geometric canonicalization in 3D: Decouples embodiment via agent-specific geometry filtering and reconstructs interaction grounding by injecting a canonical virtual gripper.
Figure 1: The LIDEA framework integrates staged 2D feature distillation on the left and explicit 3D canonicalization on the right, converging to a robust visuomotor policy leveraging mixed human-robot data.

Implicit 2D Feature Distillation

A two-stage transitive distillation pipeline aligns features for cross-embodiment equivalence:

Human to Pseudo-Robot: Using HPP-5M, a dataset of $\sim$ 5M paired frames where human hands are replaced with robot proxies, the representation space is aligned with DINOv3-based visual encoders using specialized Region-of-Interaction cropping to emphasize manipulation over background context.
Pseudo-Robot to Real-Robot: A smaller paired dataset aligns photometric variations between rendered and real robot imagery, completing the bridge. The resultant latent space is numerically aligned across all domains ( $E_H \approx E_P \approx E_R$ ).
Figure 2: HPP-5M dataset enables large-scale human-pseudo-robot alignment, supporting robust feature distillation across semantic interaction equivalence.

Explicit 3D Geometric Alignment

LIDEA enforces a strictly embodiment-agnostic geometric representation:

All agent-specific point cloud data are filtered—via visual segmentation for humans and proprioceptive-based occupancy for robots.
A canonical gripper geometry is inserted, parameterized by pose and opening state, ensuring spatial and structural symmetry across observations.

This canonicalization is critical for depth-aware policy learning, particularly when object-centric interactions or precise spatial reasoning is required.

Experimental Evaluation

Manipulation Tasks

Evaluations span four real-world manipulation tasks of increasing complexity and heterogeneity:

Close Laptop (articulated-object)
Stack (6 DoF pick-and-place)
Fold Towel (deformable-object)
Prepare Bread (long-horizon, multi-stage)
Figure 3: Benchmark tasks encompass articulated, rigid, deformable, and long-horizon manipulation challenges.

Data Efficiency and Comparative Results

LIDEA achieves substantial reductions in robot demonstration requirements, substituting up to 80% of robot data with human demonstration videos while matching or surpassing baseline policy performance. Policies trained with human data and minimal robot supervision retained high manipulation success rates across all tasks.

Figure 4: Data efficiency evaluation reveals that mixing human and robot demonstrations produces significantly higher success rates compared to pseudo-robot and robot-only baselines.

Comparisons to explicit visual editing baselines reveal degradation in tasks requiring precise 3D perception, notably in environments with deformable objects or across long-horizon sequences. LIDEA's canonical 3D observation avoids these pitfalls.

Out-of-Distribution Generalization

On folding tasks with OOD appearance—novel towels and distractors—LIDEA-trained policies incorporating human videos consistently outperformed those trained solely on in-domain robot data. The policy reliably attended to functional targets, demonstrating strong transfer of robustness to visual variation.

Ablations

Ablations on the Stack task highlight:

Dual-stage distillation is essential: Removing either feature alignment stage or using off-the-shelf representations (e.g., vanilla DINOv3) without cross-embodiment alignment leads to severe performance drops.
Scene-specific and large-scale internet pretraining provides crucial priors for deployment generalization.
Strict geometric filtering and canonical gripper filling are indispensable; their absence yields negative transfer or deployment mismatch.

Empirical Analysis of Feature Space

Sequence-level similarity analysis and PCA visualization demonstrate that the aligned encoders collapse feature distributions for human and robot observations, preserving temporal structure and consistent attention to the interaction region.

Figure 5: Sequence similarity trends and PCA embedding show that feature distillation aligns human and robot demonstration trajectories, focusing attention on manipulation-centric semantics.

Implications and Future Directions

LIDEA's principled separation and alignment of feature and geometric domains present a scalable path for cross-embodiment policy transfer. Mixing robot and human data substantially improves data efficiency and generalization, enabling robust policies with minimal teleoperation.

The framework's explicit geometric alignment presently targets standard two-finger grippers. Future directions include extension to dexterous manipulation with multi-fingered hands and integrating aligned visual encoders into VLA/video-action learning frameworks for scalable imitation from unconstrained human demonstration corpora.

Conclusion

LIDEA establishes a powerful, data-efficient pipeline for human-to-robot imitation learning that avoids the limitations of prior visual editing or unified representation methods. By jointly leveraging implicit feature distillation and explicit geometric canonicalization, LIDEA bridges the embodiment gap, facilitating robust and generalizable manipulation policies that capitalize on the richness of human activity video data.

Markdown Report Issue