Researchers present a comprehensive empirical study of pre-trained visual representations (PVRs) for Embodied AI.
They find that scaling dataset size and diversity does not universally improve performance, but their largest model, VC-1, outperforms all prior PVRs on average.
Key terms:
Artificial Visual Cortex: A model or system that replicates the human brain's visual cortex for processing visual information.
Embodied AI: Artificial intelligence that interacts with the environment through a physical body or robotic system.
Pre-trained Visual Representations (PVRs): Visual foundation models used as a starting point for further training in specific tasks.
CortexBench: A curated benchmark suite consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation.
VC-1: A large-scale model developed in the study that outperforms all prior PVRs on average and shows potential for task-specific adaptation.