How to Leverage Human Egocentric Videos for Embodied Brain Supervision
Determine how to leverage the latent planning structure and hand–object interaction regularities in large-scale human egocentric videos as explicit supervision to strengthen egocentric vision–language models for embodied cognition without using robot data, thereby improving the sample efficiency and generalization of Vision–Language–Action systems.
Sponsor
References
An open question is how to leverage the latent planning structure and hand–object interaction regularities in human egocentric videos as supervision to strengthen egocentric embodied brains without robot data, thereby improving the sample efficiency and generalization of VLA systems.
— PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
(2512.16793 - Lin et al., 18 Dec 2025) in Section 1, Introduction