Papers
Topics
Authors
Recent
Search
2000 character limit reached

MolmoBot: Zero-Shot Sim-to-Real Manipulation

Updated 3 July 2026
  • MolmoBot is a robot generalist manipulation policy family that uses multi-frame vision–language transformers and synthetic data to achieve robust zero-shot sim-to-real transfer.
  • Its architecture features a 4B-parameter Molmo2 backbone, a DiT-style flow-matching diffusion action head, and variants like MolmoBot-Pi0 and MolmoBot-SPOC for diverse applications.
  • The system demonstrates strong empirical performance across static and mobile tasks with high real-world success rates and effective sim-to-real generalization.

MolmoBot is a robot generalist manipulation policy family that demonstrates effective zero-shot sim-to-real transfer for both static and mobile manipulation, challenging the prevailing assumption that simulation alone is insufficient for robust real-world deployment. Leveraging large-scale, procedurally generated synthetic data and advanced multi-frame vision–language policy architectures, MolmoBot achieves high real-world task success without any real-world fine-tuning, outperforming prior methods that rely on extensive real-world robot hours (Deshpande et al., 17 Mar 2026).

1. Architecture and Policy Variants

1.1 Molmo2-Based Multi-Frame Vision–Language Backbone

The principal MolmoBot policy is built upon a 4B-parameter Molmo2 vision–language transformer, pre-trained on billions of image–text pairs with video grounding capabilities. This backbone uses a frozen SigLIP2 vision encoder to tokenize up to F=3F=3 RGB frames per view, each producing 192 patch tokens. These tokens are projected into the Molmo2 embedding space and prepended with view and time-offset indices. Inputs to the backbone comprise visual tokens from all timesteps and views, tokenized natural-language instructions, optional 2D point prompts for grounding, and proprioceptive state embeddings (joint positions and base pose). For F>1F>1, MolmoBot interleaves tokens from tt, tDt-D, and t2Dt-2D (with D=8D=8 steps), allowing the model to reason jointly over scene configuration and motion history.

1.2 Flow-Matching Diffusion Action Head

The DiT-style action head leverages flow matching to regress continuous robot actions. For a hidden sequence x=(visionlanguageproprioception)x = (\mathrm{vision} \oplus \mathrm{language} \oplus \mathrm{proprioception}), LL DiT blocks apply cross-attention to xx. Action trajectories a1:HRD×Ha_{1:H} \in \mathbb{R}^{D \times H} (joint positions, gripper, and base commands across F>1F>10 steps) are noised via F>1F>11, and the head predicts denoising flows F>1F>12 under a flow-matching loss. Training samples F>1F>13 denoising steps per example in parallel. At inference, denoising diffusion sampling generates the action chunk F>1F>14 for real robot execution.

1.3 MolmoBot-Pi0 and MolmoBot-SPOC

MolmoBot-Pi0 replicates the published F>1F>15 architecture (Paligemma-3B VLM encoder, DiT flow head, frozen SigLIP vision) to isolate the impact of synthetic data. MolmoBot-SPOC is a lightweight, transform-and-decode policy derived from SPOC with a single-frame frozen SigLIP2 encoder, SigLIP text encoder, bidirectional decoder, and discrete quantile-binned actions (256 bins/dimension), supporting efficient RL fine-tuning and edge deployment.

2. Synthetic Data Generation and MolmoBot-Data

2.1 MolmoBot-Engine and MolmoSpaces

MolmoBot-Engine, built on the MolmoSpaces simulation ecosystem, procedurally generates diverse training episodes. MolmoSpaces comprises 232,000 simulated homes, 48,000 articulated objects (doors, drawers, cabinets), and 11,000 rigid objects. For each episode, task-relevant objects are placed at random 6-DoF poses; environment, lighting, textures, dynamics, camera extrinsics, and initial robot/joint configurations are randomized. Scripted experts (IK for Franka FR3, cuRobo for RB-Y1) generate collision-free demonstration trajectories at high throughput (F>1F>16 episodes/GPU-hr on 100 A100s).

2.2 Dataset Statistics

MolmoBot-Data consists of 1.8 million expert episodes (299 million frames, 5,817 robot-hours) across eight static and mobile manipulation tasks, as detailed below.

Task Robot Episodes Frames Env Count
Pick Franka 781.8K 56.9M 73.2K
Pick-Place Franka 554.2K 143.9M 61.6K
Next-To Franka 182.7K 54.7M 44.9K
Color Franka 28.6K 7.5M 5.3K
Door-Open RB-Y1 99.3K 19.5M 16.7K
Open (drawers) RB-Y1 46.6K 6.9M 10.5K
Pick RB-Y1 62.7K 7.5M 28.1K
Pick-Place RB-Y1 14.8K 2.4M 9.3K

The dataset supports supervised training across diverse object categories, environments, and manipulation types (Deshpande et al., 17 Mar 2026).

3. Training Paradigm and Data Augmentation

3.1 Objectives and Optimization

All MolmoBot policies are trained by behavior cloning from MolmoBot-Data. For the main variant, a batch size of 1,024 and AdamW optimizer (learning rate F>1F>17, weight decay F>1F>18, warmup 2,000 steps) are used for 200K steps on static tasks and 100K on mobile. The flow-matching DiT head samples F>1F>19 flow timesteps per example. Trajectories are upsampled in the dataset via grasp retries, additional samples after successful picks, and post-completion augmentation.

MolmoBot-Pi0 and MolmoBot-SPOC are trained under identical or proportionally reduced regimens, the latter using batch size 512, 100K steps, and 256-bin quantile tokenization.

3.2 Domain Randomization and Augmentation

Episodes are rendered with stochastic variation in lighting, texture, mass, friction, damping, camera extrinsics, initial joint offsets, and action noise. Training further applies OFA-style image augmentations (ColorJitter, GaussianBlur, RandomPosterize, RandomSharpness, RandomGrayscale) and procedural language prompt randomization to enhance robustness.

4. Evaluation and Zero-Shot Transfer

4.1 Robotic Platforms and Tasks

Evaluation spans both static (Franka FR3, 7-DoF arm with Robotiq gripper) and mobile (Rainbow Robotics RB-Y1, holonomic base and dual 7-DoF arms) hardware, with real and simulated tasks including pick, pick-and-place, next-to, color sorting, door opening, drawer/cabinet opening, and mobile pick-and-place.

4.2 Zero-Shot Real-World and Simulation Results

Key performance results include:

  • Franka FR3 real pick-and-place (120 trials):
    • MolmoBot {F=2}: tt0
    • MolmoBot {F=3}: tt1
    • MolmoBot-Pi0: tt2
    • tt3-DROID (zero-shot): tt4
    • tt5-DROID (zero-shot): tt6
  • Franka FR3 held-out simulation (“Pick-MSProc”):
    • MolmoBot {2}: tt7
    • MolmoBot-Pi0: tt8
    • tt9: tDt-D0
    • tDt-D1 + fined: tDt-D2
  • RB-Y1 mobile manipulation:
    • MolmoBot (door opening) in sim: tDt-D3 (generalist), tDt-D4 (specialist)
    • In 9 real-world trials: 4/9 successful grasps, 2/9 full door opens

MolmoBot thus exceeds sim-to-real performance of baselines trained on tDt-D5 real-world robot hours, substantiating the impact of extensive, procedurally diverse simulation.

5. Empirical Insights and Ablation Analyses

Dataset and architectural ablations reveal:

  • Increasing trajectory count (10K→25K→50K) monotonically improves real pick success (tDt-D6).
  • Object diversity strongly benefits simulation but less so real-world transfer.
  • Expanding environment count beyond 1,000 has marginal real-world impact for local tasks, suggesting local workspace factors dominate.
  • Flow denoising timesteps tDt-D7: real-world success peaks at tDt-D8; simulated success increases up to tDt-D9.
  • Absolute joint targets outperform delta targets for real manipulation (real: t2Dt-2D0 absolute vs t2Dt-2D1 delta; sim: both t2Dt-2D2).

A plausible implication is that scale in both scene and object diversity mainly enhances simulation generalization, while sim-to-real transfer is bottlenecked primarily by local perception-action couplings.

6. Limitations and Prospective Directions

MolmoBot-Engine is currently restricted to rigid and articulated manipulations (doors, drawers, pick-and-place). Extension to contact-rich, deformable, fluid, or soft-body tasks is noted as a key future direction. MolmoBot-SPOC’s compactness and discretized actions are conducive to on-robot RL fine-tuning (e.g., FLaRe) and resource-constrained deployment. Additional research avenues include lightweight real-world finetuning (e.g., LoRA on the DiT head), cross-modal domain adaptation, and improved physical fidelity via learned world models (Deshpande et al., 17 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoBot.