How to Leverage Human Egocentric Videos for Embodied Brain Supervision

Determine how to leverage the latent planning structure and hand–object interaction regularities in large-scale human egocentric videos as explicit supervision to strengthen egocentric vision–language models for embodied cognition without using robot data, thereby improving the sample efficiency and generalization of Vision–Language–Action systems.

Background

The paper argues that most vision–LLMs are trained on third-person data, which creates a viewpoint mismatch for robots operating under egocentric perception. Collecting robot egocentric data at scale is costly and limited in diversity, whereas human first-person videos are abundant and naturally encode interaction context and causal structure.

The central challenge identified is converting raw human egocentric video into structured and reliable supervision that captures planning, state changes, and hand–object interactions. This leads to an explicit open question about how best to use the latent structure in human egocentric videos to supervise and enhance egocentric embodied brains without relying on robot-collected data.

References

An open question is how to leverage the latent planning structure and hand–object interaction regularities in human egocentric videos as supervision to strengthen egocentric embodied brains without robot data, thereby improving the sample efficiency and generalization of VLA systems.

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence (2512.16793 - Lin et al., 18 Dec 2025) in Section 1, Introduction