Utilizing Vision-Language Models for Physical Robot Control
Determine effective mechanisms to leverage the emergent capabilities of large vision-language models pre-trained on Internet-scale data to reliably control physical robots for real-world manipulation tasks.
References
While the recent advances in vision LLMs (VLMs) present unprecedented opportunities to solve unseen problems, how to utilize their emergent capabilities to control robots in the physical world remains an open question.
                — MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting
                
                (2403.03174 - Liu et al., 5 Mar 2024) in Abstract