This paper presents Robotic Programmer (RoboPro), a robotic foundation model designed for zero-shot generalization in robotic manipulation tasks. The core idea is to leverage large vision-LLMs (VLMs) and atomic skill libraries to generate executable policy code based on visual observations and free-form language instructions. This approach bridges the gap between high-level task descriptions and low-level robot actions using an intermediate, interpretable code representation.
A significant challenge in training such models is the high cost and inefficiency of collecting sufficient multimodal runtime code data. To address this, the authors introduce Video2Code, an automatic data curation pipeline that synthesizes executable code from extensive in-the-wild instructional videos. Video2Code operates in two stages:
- Plan Extraction: A Draft VLM (Gemini-1.5-Flash) processes instructional video clips and user instructions to generate a brief natural language plan, outlining the steps required to complete the task.
- Policy Code Generation: A Code LLM (DeepSeek-Coder-V2) takes the natural language plan, the original instruction, and definitions of a predefined API library as input. It then generates executable policy code in a step-by-step, Chain-of-Thought format, utilizing the provided APIs.
Using Video2Code, the authors synthesized 115,000 robot execution code data points from the DROID dataset (Khazatsky et al., 19 Mar 2024 ). This automatically curated dataset is used for supervised fine-tuning of the RoboPro model.
The RoboPro model architecture combines a vision encoder and a pre-trained LLM connected by a lightweight MLP adapter. The vision encoder (SigLIP-L) processes the input image (RGB-D from a wrist camera), and the adapter projects the visual tokens into the embedding space of the LLM. The base LLM is a code-domain model, CodeQwen-1.5. Both visual and text tokens (including user instruction and API definitions) are concatenated and fed into the LLM, which is trained to generate executable runtime code.
The training procedure involves three stages: visual alignment (training the adapter), pre-training on general image-text pairs, and supervised fine-tuning using the 115k Video2Code dataset and general VLM data (like LLaVA-1.5 [26296]).
The generated policy code consists of calls to functions defined in a structured API library (LAPI). This library is divided into:
- Perception Modules (Lper): APIs for processing visual information, such as
get_obj_bbox(description: str)
to find object bounding boxes orget_best_grasp_pos(grasp_bbox: bbox)
to determine grasp poses. - Control Modules (Lcon): APIs for executing robot actions, such as
move_to_pose(Pose)
,close_gripper()
,open_gripper()
, and motion planning APIs likefollow_way(path: List[Pose])
,generate_wipe_path(region: str)
.
The policy code calls these APIs with appropriate parameters derived from visual reasoning and task instructions. The API implementation for a specific robot platform translates these calls into low-level actions (e.g., sending commands to an operational space controller).
Practical Implementation Aspects and Applications:
- Zero-Shot Deployment: RoboPro's primary benefit is its zero-shot generalization capability. Once trained, the model can be deployed on new robots, tasks, and environments provided a compatible API library is implemented for the target platform. This significantly reduces the need for task-specific data collection and fine-tuning on new setups.
- Scalable Data Generation: The Video2Code pipeline offers a much more efficient and lower-cost method for acquiring large-scale, multimodal training data compared to manual annotation or data collection in controlled simulation environments. This is crucial for training powerful foundation models for robotics.
- Interpretable Policies: Generating policy code results in a more interpretable representation of the robot's plan compared to directly generating low-level action sequences. This can aid in debugging and understanding model behavior.
- Adaptability: The model shows robustness to variations in API formats (renaming, refactoring) and can generalize to perform tasks requiring unseen skills, provided the API library is extended with the necessary functions. This suggests that the learned procedural knowledge is transferable.
- Real-World Application: Experiments on a Franka Emika robot demonstrate that RoboPro can successfully execute various tasks in real-world scenarios without task-specific fine-tuning. The model exhibits visual reasoning abilities, such as selecting appropriate tools for a task.
Implementation Considerations:
- API Library Development: The performance and capabilities of the system heavily depend on the richness and robustness of the API library implementation for the specific robot platform. Implementing reliable perception (e.g., object detection, pose estimation) and control primitives is critical.
- Computational Requirements: Running inference with a VLM and LLM requires substantial computational resources (GPU memory and processing power), especially for larger models. Optimization techniques like quantization or model serving frameworks might be necessary for real-time performance on embedded systems or cost-constrained deployments.
- Perception System Accuracy: The model's generated code relies on the output of perception APIs (e.g., accurate bounding boxes). Errors in the underlying perception system will likely lead to errors in the execution.
- Error Handling: The generated code needs to be executed robustly. Implementing error handling mechanisms for API calls (e.g., retries, alternative strategies) in the execution environment is important for reliable operation.
- Prompt Engineering: While the model is fine-tuned, the inference prompt (including API definitions and instructions) plays a significant role in guiding the code generation. Careful prompt design is necessary.
The paper shows that RoboPro achieves state-of-the-art zero-shot success rates on RLBench and LIBERO simulations, significantly outperforming existing code generation methods like CaP [9493] and proprietary models like GPT-4o [https://openai.com/index/hello-gpt-4o/]. Ablation studies confirm the critical role of the Video2Code data pipeline and the choice of the base code LLM in achieving these performance levels. The research demonstrates a promising path towards building general-purpose robotic agents capable of understanding and executing complex tasks from multimodal instructions.