EgoBridge: Joint Human-Robot Policy Alignment

Updated 28 September 2025

EgoBridge is a framework that aligns egocentric human demonstrations and robot observations to enable scalable imitation learning.
It employs a unified co-training architecture with a shared feature encoder and policy decoder to map observations into a common latent space.
Its use of Optimal Transport with action-aware cost functions substantially improves policy success rates and generalization across diverse real-world tasks.

EgoBridge refers to a set of methodologies and frameworks developed to enable scalable, generalizable imitation learning for robots by directly leveraging rich, egocentric human demonstration data. It addresses the significant domain gap in sensor modalities, visual appearance, and kinematic structure between naturally collected human data and robot observations by employing explicit policy latent space alignment, primarily using Optimal Transport theory to ensure action-relevant information is preserved for end-to-end policy learning (Punamiya et al., 23 Sep 2025).

1. Motivation and Problem Context

The collection of large-scale robot demonstration datasets is constrained by the cost and labor associated with teleoperation and manual annotation. In contrast, contemporary wearable devices—such as smart glasses and extended reality (XR) platforms—produce abundant egocentric human data capturing natural interactions and manipulations. Despite this, transferring knowledge directly from human to robot is impeded by (a) appearance domain shift (backgrounds, lighting, hand vs gripper shape), (b) sensor discrepancies (RGB-D vs robot camera), and (c) embodiment divergence (human arm vs robot manipulator kinematics). EgoBridge seeks to unify these domains by learning joint observation–action representations that are robustly aligned across embodiments, so robots can exploit the diversity of human demonstrations without requiring exhaustive robot data.

2. Co-Training Framework Architecture

EgoBridge introduces a unified co-training architecture comprised of a shared feature encoder $f_{\phi}$ and a policy decoder $\pi_{\theta}$ . The observation streams from both the human domain ( $o^H$ ) and robot domain ( $o^R$ ) are mapped by $f_{\phi}$ to a common latent space $Z$ . The policy $\pi_{\theta}$ then produces control actions $a$ in a unified action space. Unlike conventional approaches that align only visual features or employ naive behavior cloning, EgoBridge couples standard supervised policy loss with a joint domain adaptation loss that ensures both latent features and associated actions are aligned between domains.

$L_{\text{BC-cotrain}}(\phi, \theta) = \mathbb{E}_{(o,a) \sim D^H \cup D^R} \left[ L_{\text{BC}}(\pi_{\theta}(f_{\phi}(o)), a) \right]$

where $D^H$ and $D^R$ are human and robot datasets, respectively.

3. Optimal Transport for Joint Policy Alignment

The key innovation in EgoBridge is the use of Optimal Transport (OT) to explicitly align the joint distribution of (feature, action) pairs. For mini-batches of human and robot trajectories, EgoBridge computes a transport plan between the sets $\{(f_{\phi}(o^H_i), a^H_i)\}$ and $\{(f_{\phi}(o^R_j), a^R_j)\}$ based on an action-aware cost metric. The OT loss is defined as:

$L_{\text{OT-joint}}(\phi) = \sum_{i,j} (T^*_\epsilon)_{ij} \cdot C\left( (f_{\phi}(o^H_i), a^H_i), (f_{\phi}(o^R_j), a^R_j) \right)$

where $T^*_\epsilon$ is computed by the Sinkhorn algorithm with entropic regularization $\epsilon$ . Crucially, the cost function $C(\cdot, \cdot)$ uses a Dynamic Time Warping (DTW) metric on action trajectories to preferentially align temporally and behaviorally similar demonstrations. For a pair indexed $(i, j)$ , the cost matrix is:

$\tilde{C}_{ij} = \begin{cases} \lambda \cdot D_{ij} & \text{if } i = i^*(j) \ D_{ij} & \text{otherwise} \end{cases}$

where $D_{ij} = \|f_{\phi}(o^H_i) - f_{\phi}(o^R_j)\|^2$ , and $i^*(j)$ indexes the best-matching human trajectory to a given robot example. The parameter $\lambda$ ensures that behaviorally matched trajectory pairs are favored in the alignment.

4. Empirical Policy Success Rate and Comparative Metrics

EgoBridge demonstrates a substantial and consistent improvement in policy performance relative to existing baselines. Specifically, the framework achieved up to a 44% absolute increase in policy success rate compared to human-augmented cross-embodiment methods, standard OT-based domain adaptation, and robot-only supervision. Success rate is defined as the proportion of evaluations that exceed a target reward threshold (commonly set to $0.9$ for maximum Intersection-over-Union with the task goal). Both mean reward and success rate metrics are reported across simulation benchmarks (e.g., push-T) and real-world robotic manipulation tasks.

5. Generalization Across Objects, Scenes, and Task Domains

A defining advantage of EgoBridge is its ability to generalize to novel contexts—objects, environments, or manipulations—seen only in the human data, without explicit robot-side supervision. Because the OT loss enforces alignment over the joint latent–action distribution, knowledge encoded in variable human demonstration styles and scenarios is preserved and transferred. For example, during drawer manipulation, robots trained only on partial quadrants can correctly manipulate objects in unseen drawer locations present solely in human demonstrations. This generalizability is not observed in baselines whose alignment disregards action or circumvents cross-embodiment representation.

6. Representative Real-World Task Domains

EgoBridge’s real-world evaluations span several manipulation domains:

Scoop Coffee: Single-arm scooping and transfer of coffee beans into target containers, with human and robot data differing in object type and spatial arrangement.
Drawer Manipulation: Pick and place into a $6 \times 4$ drawer array, closing drawers. Robot data is limited to three quadrants, while human demonstrations span all four, enabling extrapolation to new locations.
Laundry Folding: Bimanual folding and placement, where variation in shirt pose and folding strategy necessitates broad policy generalization.

The project website hosts supplementary task demonstration videos, latent space visualizations, ablation experiments, and more detailed metrics (https://ego-bridge.github.io/).

7. Technical Significance and Applications

EgoBridge establishes a scalable route for robots to learn manipulation policies from large and diverse human demonstration datasets collected via standard wearable devices. It transforms egocentric experience streams into robot-executable knowledge by enforcing joint representation alignment at the level of policy-relevant observation–action pairs. This method can accelerate development in domains where collection of large robot datasets is impractical and supports adaptation to complex, unstructured settings (novel scenes, objects, or manipulating styles) that are critical for autonomous systems. The alignment strategy is extensible to other cross-embodiment policy domains, where action-aware adaptation is essential for reliable knowledge transfer.

PDF Markdown Chat (Pro)

References (1)

EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data (2025)

Follow Topic

Get notified by email when new papers are published related to EgoBridge.