Papers
Topics
Authors
Recent
Search
2000 character limit reached

MolmoBot-Engine: Robotic Manipulation Simulation

Updated 3 July 2026
  • MolmoBot-Engine is a fully open-source pipeline that uses procedurally generated synthetic environments and expert trajectories for zero-shot sim-to-real robotic manipulation.
  • It generates over 1.8M episodes from 200,000 unique indoor scenes, ensuring diverse training data that improves robustness and generalization.
  • The system's modular design—including environment generation, simulator wrapping, expert demonstration export, and data recording—supports both large-scale and edge-deployable policy models.

MolmoBot-Engine is a fully open-source, end-to-end robotic manipulation simulation pipeline designed to enable zero-shot sim-to-real transfer of manipulation policies. Built on the MolmoSpaces ecosystem, MolmoBot-Engine procedurally generates diverse synthetic environments, produces large-scale expert trajectory datasets (MolmoBot-Data), and facilitates the training of several modern policy architectures. The core claim evidenced by MolmoBot-Engine is that sufficiently diverse and large synthetic training data can enable generalization to previously unseen real-world objects and environments—without requiring any real-world data collection or fine-tuning (Deshpande et al., 17 Mar 2026).

1. System Architecture and Data Generation Pipeline

MolmoBot-Engine organizes data generation into four modular components within the MolmoSpaces environment:

  1. Environment Generator: Samples procedurally generated indoor scenes EE from a distribution Dscene\mathcal{D}_{scene} of approximately 200,000 unique house layouts. Task-relevant objects and distractors are placed by sampling from Dobject\mathcal{D}_{object} (using assets from iTHOR and Objaverse), filtered on size and semantic relevance. Domain randomization samples ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}—covering lighting (position, color, intensity), textures (procedural maps), and physics parameters (friction, mass, damping; e.g., friction ∼U(0.2,1.0)\sim \mathcal{U}(0.2,1.0)). Objects are positioned via collision-free 6-DoF transforms.
  2. Simulator Wrapper: Instantiates scenes in MuJoCo, configuring either a Franka FR3 or Rainbow Robotics RB-Y1 robot (r∼Drobotr\sim \mathcal{D}_{robot}), with initial joint perturbations δqi∼U(−ri,ri)\delta q_i \sim \mathcal{U}(-r_i, r_i). Action-proportional noise is applied in TCP space, and camera extrinsics are perturbed per episode.
  3. Expert Trajectory Exporter: Utilizes scripted demonstrators that plan collision-free, phase-based motions (pick, pick-and-place, open/articulate) by selecting from a precomputed grasp library, performing collision checks, and IK feasibility verification. The Franka uses IK interpolation; RB-Y1 leverages the GPU-accelerated cuRobo optimizer (Sundaralingam et al., 2023). Automatic retries (up to 3) are completed on failure.
  4. Data Recorder & Storage: Captures at each timestep: multi-view RGB, proprioceptive state, action targets (absolute/delta joint, TCP twist), camera calibration, and privileged simulator data. Episodes are serialized as protobufs; the dataset comprises ≈\approx300M frames over 1.8M episodes for downstream vision-language-action (VLA) model training.

The overall data generation process is a draw from the joint procedural distribution:

(E,r,{oi},ϕ)∼Dscene×Drobot×Dobject×Ddomain(E, r, \{o_i\}, \phi) \sim \mathcal{D}_{scene} \times \mathcal{D}_{robot} \times \mathcal{D}_{object} \times \mathcal{D}_{domain}

where r∈{Franka,RB-Y1}r\in\{\text{Franka},\text{RB-Y1}\} and Dscene\mathcal{D}_{scene}0.

2. Expert Demonstration Generation and Dataset Statistics

The expert trajectory system generated 1.8 million episodes (∼5,800 robot-hours), utilizing ∼4,500 A100-GPU hours (Dscene\mathcal{D}_{scene}11,024 episodes per GPU-hour). Grasp and waypoint planning exploits a precomputed grasp library (MolmoSpaces), broadphase collision filtering in MuJoCo, and batched IK testing; for the RB-Y1, cuRobo supplies fast collision-aware trajectory optimization.

Each demonstration proceeds in modular phases:

  • Pick: Pregrasp Dscene\mathcal{D}_{scene}2 Grasp Dscene\mathcal{D}_{scene}3 Lift
  • Pick-and-Place: Pregrasp Dscene\mathcal{D}_{scene}4 Grasp Dscene\mathcal{D}_{scene}5 Lift Dscene\mathcal{D}_{scene}6 Preplace Dscene\mathcal{D}_{scene}7 Place Dscene\mathcal{D}_{scene}8 Postplace Dscene\mathcal{D}_{scene}9 Stow
  • Open/Articulate: Pregrasp Dobject\mathcal{D}_{object}0 Grasp Dobject\mathcal{D}_{object}1 Articulate Dobject\mathcal{D}_{object}2 Postarticulate

Failures in grasp or phase feasibility trigger up to three automatic retries, with episodes discarded if unsuccessful. Actions are recorded as joint targets (Dobject\mathcal{D}_{object}3, Dobject\mathcal{D}_{object}4) and/or TCP twist (Dobject\mathcal{D}_{object}5), enabling supervision for both imitation and flow-matching objective training:

Dobject\mathcal{D}_{object}6

3. Policy Classes and Training Regimes

MolmoBot-Engine supports training of three policy classes, leveraging the generated data:

Policy Class Backbone/Encoder Action Output Specialization
MolmoBot Molmo2-4B VLM, SigLIP2 (frozen) Flow-matching chunks Multi-frame VLA, chunked action heads
MolmoBot-Pi0 Paligemma 3B VLM, SigLIP Abs. joint positions Replica of Dobject\mathcal{D}_{object}7 (OpenPI codebase)
MolmoBot-SPOC SigLIP2-Base, SigLIP Discretized actions Lightweight, deployable, RL-amenable

MolmoBot

  • Architecture: Employs a Molmo2-4B VLM backbone with SigLIP2 vision encoder (frozen, 192 patch tokens/frame; up to Dobject\mathcal{D}_{object}8 frames).
  • Text/Image Fusion: Visual tokens and text instructions (with optional point coordinates) are integrated via bidirectional attention.
  • Action Head: DiT-inspired flow-matching transformer, layer-wise cross-attending to the VLM and robot state Dobject\mathcal{D}_{object}9.
  • Training: Single/multi-frame behavioral cloning (BC), batch size 1024, 200k steps (static), 100k steps (mobile), learning rate ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}0, denoising steps ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}1.

ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}2

MolmoBot-Pi0

  • Architecture: Paligemma 3B VLM + DiT flow head (exact openpi implementation).
  • Loss: ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}3 plus BC on gripper.
  • Training: 200k steps, ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}4 learning rate, 1024 batch, 1k warmup.

MolmoBot-SPOC

  • Architecture: Lightweight, chunked transformer, SigLIP2-Base vision encoder (frozen).
  • Action Decoder: ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}5 learnable queries, bi-directional attention, discretizes actions into 256 quantile bins (categorical cross-entropy loss).
  • Parameter Count: ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}650M; single forward for 16-step chunk; supports RL fine-tuning in sim.

4. Sim-to-Real Transfer Evaluation

Zero-shot evaluation (no real-world fine-tuning) was conducted on Franka FR3 (tabletop manipulation) and RB-Y1 (mobile manipulation) platforms. Experiments spanned kitchen, office, workroom, and bedroom scenes with diverse, unseen objects and receptacles.

Key Metrics (Pick-and-Place, Tabletop):

Policy Sim-to-Real Success (%)
MolmoBot (ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}7) 79.2 (ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}8)
ϕ∼Ddomain\phi\sim \mathcal{D}_{domain}9-DROID 39.2 (∼U(0.2,1.0)\sim \mathcal{U}(0.2,1.0)0)
MolmoBot-Pi0 46.7 (∼U(0.2,1.0)\sim \mathcal{U}(0.2,1.0)1)

Mobile Manipulation (RB-Y1, Zero-Shot):

Policy Pick Pick-Place Open Door-Open
MolmoBot Multitask 44.8 22.5 25.2 70.2
MolmoBot Door Specialist - - - 77.7
MolmoBot-SPOC Rigid 10.5 1.8 - -
MolmoBot-SPOC Articulated - - 21.8 58.8

As evidenced by these results, MolmoBot substantially outperforms ∼U(0.2,1.0)\sim \mathcal{U}(0.2,1.0)2 and MolmoBot-Pi0 baselines, confirming effective zero-shot generalization across unseen objects, environments, and lighting conditions (Deshpande et al., 17 Mar 2026).

5. Core Findings and System Characteristics

Several insights emerge from the empirical investigation of MolmoBot-Engine:

  • Scale and Diversity: 1.8M simulation episodes across 94,000 environments and 20,000 assets are sufficient for robust zero-shot transfer—outperforming approaches relying on >10,000 hours of real-world data.
  • Procedural Randomization and Robustness: Rendering with MuJoCo, substantial variability in appearance and dynamics, and explicit expert retry behavior confer tolerance to real-world disturbances.
  • Architectural Modularity: Flow-matching action heads are effectively combined with either large (Molmo2, Paligemma) or lightweight (SPOC) backbones; the latter offers policies <50M parameters, suitable for edge deployment, with straightforward RL fine-tuning.
  • Limitations: The environment presently supports rigid and articulated objects; high-fidelity contact-rich tasks (peg-in-hole, cloth, fluid dynamics) are not directly modeled, and may require enhanced simulation or generative world modeling.
  • Future Directions: Extending MolmoBot-Engine for richer task domains and systematically evaluating RL fine-tuning of lightweight policies (notably SPOC) in sim is identified as a priority.

6. Impact and Implications

MolmoBot-Engine provides a systematic demonstration that extremely large, diverse, and procedurally generated synthetic data can replace real-world demonstrations for many manipulation scenarios. This enables open, large-scale sim-to-real work and presents new opportunities in scaling robot foundation models without the bottleneck of real-world annotation or fine-tuning (Deshpande et al., 17 Mar 2026).

A plausible implication is that simulation-only pipelines with sufficient diversity and procedural breadth can serve as foundation-layer infrastructure for robotic policy pretraining, analogous to "ImageNet-pretraining" in vision but grounded in synthetic manipulation interaction. The practical upshot is the opening of robot learning to a fully open, simulation-driven paradigm for a range of physical systems and manipulation challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoBot-Engine.