BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation

Published 9 Oct 2025 in cs.RO, cs.AI, and cs.LG | (2510.08572v1)

Abstract: Scaling data and models has played a pivotal role in the remarkable progress of computer vision and language. Inspired by these domains, recent efforts in robotics have similarly focused on scaling both data and model size to develop more generalizable and robust policies. However, unlike vision and language, robotics lacks access to internet-scale demonstrations across diverse robotic tasks and environments. As a result, the scale of existing datasets typically suffers from the need for manual data collection and curation. To address this problem, here we propose BLAZER, a framework that learns manipulation policies from automatically generated training data. We build on the zero-shot capabilities of LLM planners and automatically generate demonstrations for diverse manipulation tasks in simulation. Successful examples are then used to finetune an LLM and to improve its planning capabilities without human supervision. Notably, while BLAZER training requires access to the simulator's state, we demonstrate direct transfer of acquired skills to sensor-based manipulation. Through extensive experiments, we show BLAZER to significantly improve zero-shot manipulation in both simulated and real environments. Moreover, BLAZER improves on tasks outside of its training pool and enables downscaling of LLM models. Our code and data will be made publicly available on the project page.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces BLAZER, a fully automated framework leveraging zero-shot LLM planning to generate and verify high-quality manipulation demonstrations without manual data collection.
The approach uses simulation for verification, enabling a finetuned LLaMA-8B to outperform larger models and achieve an 83.2% success rate in RLBench tasks.
The framework integrates a vision pipeline for object state estimation, ensuring robust performance and effective transfer to real-world and out-of-distribution tasks.

BLAZER: Bootstrapping LLM-based Manipulation Agents with Zero-Shot Data Generation

Introduction and Motivation

The BLAZER framework addresses a central challenge in robotics: the scarcity of large-scale, diverse, and high-quality manipulation demonstrations required for training generalizable robotic policies. Unlike computer vision and NLP, robotics lacks internet-scale datasets, making manual data collection prohibitively expensive and slow. BLAZER leverages the zero-shot planning capabilities of LLMs to automatically generate and verify manipulation demonstrations in simulation, thereby bootstrapping the training of smaller, more efficient LLM-based manipulation agents without human supervision.

Methodology

BLAZER formalizes robotic manipulation as a sequence of control primitives executed by a gripper in an environment $\mathcal{E}$ , with tasks $\tau$ defined by language instructions and object states. The framework consists of three main components:

Zero-Shot Data Generation: A large LLM (e.g., LLaMA-70B) is prompted to generate executable manipulation plans for a set of tasks $\mathcal{T}$ , using privileged simulator state information.
Automatic Verification: Generated plans are executed in simulation, and only successful executions are retained, forming a curated dataset $\mathcal{D}_\text{BLAZER}$ of positive demonstrations.
Supervised Finetuning: A smaller LLM (e.g., LLaMA-8B) is finetuned on $\mathcal{D}_\text{BLAZER}$ , specializing it for manipulation policy generation.

This pipeline is fully automatic and does not require manual annotation or demonstration at any stage. The approach is illustrated in the following figure:

Figure 1: Overview of BLAZER. Given a set of manipulation tasks $\tau\in\mathcal{T}$ , LLM-generated commands are verified in simulation and successful solutions are aggregated for finetuning.

Vision-Based Deployment

To enable real-world deployment, BLAZER introduces a vision pipeline that estimates object states from multi-view RGB-D data using foundation models (Molmo, Segment Anything, M2T2). This pipeline reconstructs the environment state $\tilde{\Sigma}_\mathcal{E}$ , which is then used as input for the finetuned LLM agent, allowing transfer from simulation to real-world tasks without additional training.

Experimental Evaluation

Simulation Tasks

BLAZER is evaluated on nine RLBench pick-and-place tasks, covering a range of manipulation scenarios:

Figure 2: Tasks in simulation. Nine RLBench tasks are shown with start and end states.

The LLaMA-8B model finetuned with BLAZER achieves an average success rate of 83.2%, outperforming all baselines, including the larger LLaMA-70B (77.0%) and multi-agent MALMM (80.9%). Notably, LLaMA-8B with BLAZER surpasses its teacher model (LLaMA-70B) despite having significantly fewer parameters, demonstrating the effectiveness of the bootstrapping approach.

Visual Observations

When ground truth object states are replaced with vision-based estimates, BLAZER maintains superior performance and robustness to perception noise:

Figure 3: Tasks in simulation using visual observations. BLAZER-trained LLaMA-8B outperforms LLaMA-70B and base LLaMA-8B with vision pipeline input.

Real-World Transfer and Generalization

BLAZER is deployed on a Franka Emika Panda robot for 12 real-world tasks, including both in-distribution and out-of-distribution scenarios. The finetuned LLaMA-8B agent achieves a 47.8% average success rate, outperforming LLaMA-70B (33.3%). The agent demonstrates strong generalization to novel tasks not seen during training.

Figure 4: Real world tasks visualization. Start and end states for in-distribution and out-of-distribution tasks.

Ablation Studies

BLAZER's performance scales with the number of generated training samples per task, but exhibits diminishing returns beyond 2000 samples. Additionally, the framework is compatible with smaller LLMs (e.g., LLaMA-3.2 3B), which achieve competitive success rates (84.9%) and enable deployment on resource-constrained platforms.

Figure 5: Ablation studies. Left: Success rate vs. number of samples. Right: Model size impact, showing strong performance for smaller models.

Implementation Considerations

Computational Requirements: Data generation and verification in simulation are parallelizable and can be scaled arbitrarily. Finetuning with LoRA enables efficient adaptation of smaller LLMs.
Deployment: The vision pipeline is modular and leverages off-the-shelf foundation models, facilitating integration with diverse sensor setups.
Limitations: Current training uses only positive demonstrations; incorporating negative samples via preference-based methods (e.g., DPO) could further improve policy accuracy.

Implications and Future Directions

BLAZER demonstrates that automatic bootstrapping of manipulation data via LLMs can yield specialized agents that outperform their teacher models and generalize to novel tasks. This approach reduces reliance on manual data collection and enables scalable training of robotic policies. Future work may explore preference-based finetuning, integration with multimodal LLMs, and extension to more complex manipulation domains.

Conclusion

BLAZER provides a scalable, fully automatic framework for training LLM-based manipulation agents using zero-shot data generation and simulation-based verification. The method achieves strong empirical results in both simulation and real-world settings, with robust generalization and efficient model scaling. BLAZER represents a significant step toward practical, data-efficient training of robotic manipulation policies using foundation models.

Markdown Report Issue