MolmoBot-Engine: Robotic Manipulation Simulation

Updated 3 July 2026

MolmoBot-Engine is a fully open-source pipeline that uses procedurally generated synthetic environments and expert trajectories for zero-shot sim-to-real robotic manipulation.
It generates over 1.8M episodes from 200,000 unique indoor scenes, ensuring diverse training data that improves robustness and generalization.
The system's modular design—including environment generation, simulator wrapping, expert demonstration export, and data recording—supports both large-scale and edge-deployable policy models.

MolmoBot-Engine is a fully open-source, end-to-end robotic manipulation simulation pipeline designed to enable zero-shot sim-to-real transfer of manipulation policies. Built on the MolmoSpaces ecosystem, MolmoBot-Engine procedurally generates diverse synthetic environments, produces large-scale expert trajectory datasets (MolmoBot-Data), and facilitates the training of several modern policy architectures. The core claim evidenced by MolmoBot-Engine is that sufficiently diverse and large synthetic training data can enable generalization to previously unseen real-world objects and environments—without requiring any real-world data collection or fine-tuning (Deshpande et al., 17 Mar 2026).

1. System Architecture and Data Generation Pipeline

MolmoBot-Engine organizes data generation into four modular components within the MolmoSpaces environment:

Environment Generator: Samples procedurally generated indoor scenes $E$ from a distribution $\mathcal{D}_{scene}$ of approximately 200,000 unique house layouts. Task-relevant objects and distractors are placed by sampling from $\mathcal{D}_{object}$ (using assets from iTHOR and Objaverse), filtered on size and semantic relevance. Domain randomization samples $\phi\sim \mathcal{D}_{domain}$ —covering lighting (position, color, intensity), textures (procedural maps), and physics parameters (friction, mass, damping; e.g., friction $\sim \mathcal{U}(0.2,1.0)$ ). Objects are positioned via collision-free 6-DoF transforms.
Simulator Wrapper: Instantiates scenes in MuJoCo, configuring either a Franka FR3 or Rainbow Robotics RB-Y1 robot ( $r\sim \mathcal{D}_{robot}$ ), with initial joint perturbations $\delta q_i \sim \mathcal{U}(-r_i, r_i)$ . Action-proportional noise is applied in TCP space, and camera extrinsics are perturbed per episode.
Expert Trajectory Exporter: Utilizes scripted demonstrators that plan collision-free, phase-based motions (pick, pick-and-place, open/articulate) by selecting from a precomputed grasp library, performing collision checks, and IK feasibility verification. The Franka uses IK interpolation; RB-Y1 leverages the GPU-accelerated cuRobo optimizer (Sundaralingam et al., 2023). Automatic retries (up to 3) are completed on failure.
Data Recorder & Storage: Captures at each timestep: multi-view RGB, proprioceptive state, action targets (absolute/delta joint, TCP twist), camera calibration, and privileged simulator data. Episodes are serialized as protobufs; the dataset comprises $\approx$ 300M frames over 1.8M episodes for downstream vision-language-action (VLA) model training.

The overall data generation process is a draw from the joint procedural distribution:

$(E, r, \{o_i\}, \phi) \sim \mathcal{D}_{scene} \times \mathcal{D}_{robot} \times \mathcal{D}_{object} \times \mathcal{D}_{domain}$

where $r\in\{\text{Franka},\text{RB-Y1}\}$ and $\mathcal{D}_{scene}$ 0.

2. Expert Demonstration Generation and Dataset Statistics

The expert trajectory system generated 1.8 million episodes (∼5,800 robot-hours), utilizing ∼4,500 A100-GPU hours ( $\mathcal{D}_{scene}$ 11,024 episodes per GPU-hour). Grasp and waypoint planning exploits a precomputed grasp library (MolmoSpaces), broadphase collision filtering in MuJoCo, and batched IK testing; for the RB-Y1, cuRobo supplies fast collision-aware trajectory optimization.

Each demonstration proceeds in modular phases:

Pick: Pregrasp $\mathcal{D}_{scene}$ 2 Grasp $\mathcal{D}_{scene}$ 3 Lift
Pick-and-Place: Pregrasp $\mathcal{D}_{scene}$ 4 Grasp $\mathcal{D}_{scene}$ 5 Lift $\mathcal{D}_{scene}$ 6 Preplace $\mathcal{D}_{scene}$ 7 Place $\mathcal{D}_{scene}$ 8 Postplace $\mathcal{D}_{scene}$ 9 Stow
Open/Articulate: Pregrasp $\mathcal{D}_{object}$ 0 Grasp $\mathcal{D}_{object}$ 1 Articulate $\mathcal{D}_{object}$ 2 Postarticulate

Failures in grasp or phase feasibility trigger up to three automatic retries, with episodes discarded if unsuccessful. Actions are recorded as joint targets ( $\mathcal{D}_{object}$ 3, $\mathcal{D}_{object}$ 4) and/or TCP twist ( $\mathcal{D}_{object}$ 5), enabling supervision for both imitation and flow-matching objective training:

$\mathcal{D}_{object}$ 6

3. Policy Classes and Training Regimes

MolmoBot-Engine supports training of three policy classes, leveraging the generated data:

Policy Class	Backbone/Encoder	Action Output	Specialization
MolmoBot	Molmo2-4B VLM, SigLIP2 (frozen)	Flow-matching chunks	Multi-frame VLA, chunked action heads
MolmoBot-Pi0	Paligemma 3B VLM, SigLIP	Abs. joint positions	Replica of $\mathcal{D}_{object}$ 7 (OpenPI codebase)
MolmoBot-SPOC	SigLIP2-Base, SigLIP	Discretized actions	Lightweight, deployable, RL-amenable

MolmoBot

Architecture: Employs a Molmo2-4B VLM backbone with SigLIP2 vision encoder (frozen, 192 patch tokens/frame; up to $\mathcal{D}_{object}$ 8 frames).
Text/Image Fusion: Visual tokens and text instructions (with optional point coordinates) are integrated via bidirectional attention.
Action Head: DiT-inspired flow-matching transformer, layer-wise cross-attending to the VLM and robot state $\mathcal{D}_{object}$ 9.
Training: Single/multi-frame behavioral cloning (BC), batch size 1024, 200k steps (static), 100k steps (mobile), learning rate $\phi\sim \mathcal{D}_{domain}$ 0, denoising steps $\phi\sim \mathcal{D}_{domain}$ 1.

$\phi\sim \mathcal{D}_{domain}$ 2

MolmoBot-Pi0

Architecture: Paligemma 3B VLM + DiT flow head (exact openpi implementation).
Loss: $\phi\sim \mathcal{D}_{domain}$ 3 plus BC on gripper.
Training: 200k steps, $\phi\sim \mathcal{D}_{domain}$ 4 learning rate, 1024 batch, 1k warmup.

MolmoBot-SPOC

Architecture: Lightweight, chunked transformer, SigLIP2-Base vision encoder (frozen).
Action Decoder: $\phi\sim \mathcal{D}_{domain}$ 5 learnable queries, bi-directional attention, discretizes actions into 256 quantile bins (categorical cross-entropy loss).
Parameter Count: $\phi\sim \mathcal{D}_{domain}$ 650M; single forward for 16-step chunk; supports RL fine-tuning in sim.

4. Sim-to-Real Transfer Evaluation

Zero-shot evaluation (no real-world fine-tuning) was conducted on Franka FR3 (tabletop manipulation) and RB-Y1 (mobile manipulation) platforms. Experiments spanned kitchen, office, workroom, and bedroom scenes with diverse, unseen objects and receptacles.

Key Metrics (Pick-and-Place, Tabletop):

Policy	Sim-to-Real Success (%)
MolmoBot ( $\phi\sim \mathcal{D}_{domain}$ 7)	79.2 ( $\phi\sim \mathcal{D}_{domain}$ 8)
$\phi\sim \mathcal{D}_{domain}$ 9-DROID	39.2 ( $\sim \mathcal{U}(0.2,1.0)$ 0)
MolmoBot-Pi0	46.7 ( $\sim \mathcal{U}(0.2,1.0)$ 1)

Mobile Manipulation (RB-Y1, Zero-Shot):

Policy	Pick	Pick-Place	Open	Door-Open
MolmoBot Multitask	44.8	22.5	25.2	70.2
MolmoBot Door Specialist	-	-	-	77.7
MolmoBot-SPOC Rigid	10.5	1.8	-	-
MolmoBot-SPOC Articulated	-	-	21.8	58.8

As evidenced by these results, MolmoBot substantially outperforms $\sim \mathcal{U}(0.2,1.0)$ 2 and MolmoBot-Pi0 baselines, confirming effective zero-shot generalization across unseen objects, environments, and lighting conditions (Deshpande et al., 17 Mar 2026).

5. Core Findings and System Characteristics

Several insights emerge from the empirical investigation of MolmoBot-Engine:

Scale and Diversity: 1.8M simulation episodes across 94,000 environments and 20,000 assets are sufficient for robust zero-shot transfer—outperforming approaches relying on >10,000 hours of real-world data.
Procedural Randomization and Robustness: Rendering with MuJoCo, substantial variability in appearance and dynamics, and explicit expert retry behavior confer tolerance to real-world disturbances.
Architectural Modularity: Flow-matching action heads are effectively combined with either large (Molmo2, Paligemma) or lightweight (SPOC) backbones; the latter offers policies <50M parameters, suitable for edge deployment, with straightforward RL fine-tuning.
Limitations: The environment presently supports rigid and articulated objects; high-fidelity contact-rich tasks (peg-in-hole, cloth, fluid dynamics) are not directly modeled, and may require enhanced simulation or generative world modeling.
Future Directions: Extending MolmoBot-Engine for richer task domains and systematically evaluating RL fine-tuning of lightweight policies (notably SPOC) in sim is identified as a priority.

6. Impact and Implications

MolmoBot-Engine provides a systematic demonstration that extremely large, diverse, and procedurally generated synthetic data can replace real-world demonstrations for many manipulation scenarios. This enables open, large-scale sim-to-real work and presents new opportunities in scaling robot foundation models without the bottleneck of real-world annotation or fine-tuning (Deshpande et al., 17 Mar 2026).

A plausible implication is that simulation-only pipelines with sufficient diversity and procedural breadth can serve as foundation-layer infrastructure for robotic policy pretraining, analogous to "ImageNet-pretraining" in vision but grounded in synthetic manipulation interaction. The practical upshot is the opening of robot learning to a fully open, simulation-driven paradigm for a range of physical systems and manipulation challenges.

Markdown Report Issue Upgrade to Chat

References (1)

MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoBot-Engine.