Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation (2501.04268v1)

Published 8 Jan 2025 in cs.RO and cs.CV

Abstract: Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of LLMs and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-LLM and code-domain LLM. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.

This paper presents Robotic Programmer (RoboPro), a robotic foundation model designed for zero-shot generalization in robotic manipulation tasks. The core idea is to leverage large vision-LLMs (VLMs) and atomic skill libraries to generate executable policy code based on visual observations and free-form language instructions. This approach bridges the gap between high-level task descriptions and low-level robot actions using an intermediate, interpretable code representation.

A significant challenge in training such models is the high cost and inefficiency of collecting sufficient multimodal runtime code data. To address this, the authors introduce Video2Code, an automatic data curation pipeline that synthesizes executable code from extensive in-the-wild instructional videos. Video2Code operates in two stages:

  1. Plan Extraction: A Draft VLM (Gemini-1.5-Flash) processes instructional video clips and user instructions to generate a brief natural language plan, outlining the steps required to complete the task.
  2. Policy Code Generation: A Code LLM (DeepSeek-Coder-V2) takes the natural language plan, the original instruction, and definitions of a predefined API library as input. It then generates executable policy code in a step-by-step, Chain-of-Thought format, utilizing the provided APIs.

Using Video2Code, the authors synthesized 115,000 robot execution code data points from the DROID dataset (Khazatsky et al., 19 Mar 2024 ). This automatically curated dataset is used for supervised fine-tuning of the RoboPro model.

The RoboPro model architecture combines a vision encoder and a pre-trained LLM connected by a lightweight MLP adapter. The vision encoder (SigLIP-L) processes the input image (RGB-D from a wrist camera), and the adapter projects the visual tokens into the embedding space of the LLM. The base LLM is a code-domain model, CodeQwen-1.5. Both visual and text tokens (including user instruction and API definitions) are concatenated and fed into the LLM, which is trained to generate executable runtime code.

The training procedure involves three stages: visual alignment (training the adapter), pre-training on general image-text pairs, and supervised fine-tuning using the 115k Video2Code dataset and general VLM data (like LLaVA-1.5 [26296]).

The generated policy code consists of calls to functions defined in a structured API library (LAPI). This library is divided into:

  • Perception Modules (Lper): APIs for processing visual information, such as get_obj_bbox(description: str) to find object bounding boxes or get_best_grasp_pos(grasp_bbox: bbox) to determine grasp poses.
  • Control Modules (Lcon): APIs for executing robot actions, such as move_to_pose(Pose), close_gripper(), open_gripper(), and motion planning APIs like follow_way(path: List[Pose]), generate_wipe_path(region: str).

The policy code calls these APIs with appropriate parameters derived from visual reasoning and task instructions. The API implementation for a specific robot platform translates these calls into low-level actions (e.g., sending commands to an operational space controller).

Practical Implementation Aspects and Applications:

  • Zero-Shot Deployment: RoboPro's primary benefit is its zero-shot generalization capability. Once trained, the model can be deployed on new robots, tasks, and environments provided a compatible API library is implemented for the target platform. This significantly reduces the need for task-specific data collection and fine-tuning on new setups.
  • Scalable Data Generation: The Video2Code pipeline offers a much more efficient and lower-cost method for acquiring large-scale, multimodal training data compared to manual annotation or data collection in controlled simulation environments. This is crucial for training powerful foundation models for robotics.
  • Interpretable Policies: Generating policy code results in a more interpretable representation of the robot's plan compared to directly generating low-level action sequences. This can aid in debugging and understanding model behavior.
  • Adaptability: The model shows robustness to variations in API formats (renaming, refactoring) and can generalize to perform tasks requiring unseen skills, provided the API library is extended with the necessary functions. This suggests that the learned procedural knowledge is transferable.
  • Real-World Application: Experiments on a Franka Emika robot demonstrate that RoboPro can successfully execute various tasks in real-world scenarios without task-specific fine-tuning. The model exhibits visual reasoning abilities, such as selecting appropriate tools for a task.

Implementation Considerations:

  • API Library Development: The performance and capabilities of the system heavily depend on the richness and robustness of the API library implementation for the specific robot platform. Implementing reliable perception (e.g., object detection, pose estimation) and control primitives is critical.
  • Computational Requirements: Running inference with a VLM and LLM requires substantial computational resources (GPU memory and processing power), especially for larger models. Optimization techniques like quantization or model serving frameworks might be necessary for real-time performance on embedded systems or cost-constrained deployments.
  • Perception System Accuracy: The model's generated code relies on the output of perception APIs (e.g., accurate bounding boxes). Errors in the underlying perception system will likely lead to errors in the execution.
  • Error Handling: The generated code needs to be executed robustly. Implementing error handling mechanisms for API calls (e.g., retries, alternative strategies) in the execution environment is important for reliable operation.
  • Prompt Engineering: While the model is fine-tuned, the inference prompt (including API definitions and instructions) plays a significant role in guiding the code generation. Careful prompt design is necessary.

The paper shows that RoboPro achieves state-of-the-art zero-shot success rates on RLBench and LIBERO simulations, significantly outperforming existing code generation methods like CaP [9493] and proprietary models like GPT-4o [https://openai.com/index/hello-gpt-4o/]. Ablation studies confirm the critical role of the Video2Code data pipeline and the choice of the base code LLM in achieving these performance levels. The research demonstrates a promising path towards building general-purpose robotic agents capable of understanding and executing complex tasks from multimodal instructions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Senwei Xie (2 papers)
  2. Hongyu Wang (104 papers)
  3. Zhanqi Xiao (1 paper)
  4. Ruiping Wang (32 papers)
  5. Xilin Chen (119 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com