MolmoBot: Zero-Shot Sim-to-Real Manipulation
- MolmoBot is a robot generalist manipulation policy family that uses multi-frame vision–language transformers and synthetic data to achieve robust zero-shot sim-to-real transfer.
- Its architecture features a 4B-parameter Molmo2 backbone, a DiT-style flow-matching diffusion action head, and variants like MolmoBot-Pi0 and MolmoBot-SPOC for diverse applications.
- The system demonstrates strong empirical performance across static and mobile tasks with high real-world success rates and effective sim-to-real generalization.
MolmoBot is a robot generalist manipulation policy family that demonstrates effective zero-shot sim-to-real transfer for both static and mobile manipulation, challenging the prevailing assumption that simulation alone is insufficient for robust real-world deployment. Leveraging large-scale, procedurally generated synthetic data and advanced multi-frame vision–language policy architectures, MolmoBot achieves high real-world task success without any real-world fine-tuning, outperforming prior methods that rely on extensive real-world robot hours (Deshpande et al., 17 Mar 2026).
1. Architecture and Policy Variants
1.1 Molmo2-Based Multi-Frame Vision–Language Backbone
The principal MolmoBot policy is built upon a 4B-parameter Molmo2 vision–language transformer, pre-trained on billions of image–text pairs with video grounding capabilities. This backbone uses a frozen SigLIP2 vision encoder to tokenize up to RGB frames per view, each producing 192 patch tokens. These tokens are projected into the Molmo2 embedding space and prepended with view and time-offset indices. Inputs to the backbone comprise visual tokens from all timesteps and views, tokenized natural-language instructions, optional 2D point prompts for grounding, and proprioceptive state embeddings (joint positions and base pose). For , MolmoBot interleaves tokens from , , and (with steps), allowing the model to reason jointly over scene configuration and motion history.
1.2 Flow-Matching Diffusion Action Head
The DiT-style action head leverages flow matching to regress continuous robot actions. For a hidden sequence , DiT blocks apply cross-attention to . Action trajectories (joint positions, gripper, and base commands across 0 steps) are noised via 1, and the head predicts denoising flows 2 under a flow-matching loss. Training samples 3 denoising steps per example in parallel. At inference, denoising diffusion sampling generates the action chunk 4 for real robot execution.
1.3 MolmoBot-Pi0 and MolmoBot-SPOC
MolmoBot-Pi0 replicates the published 5 architecture (Paligemma-3B VLM encoder, DiT flow head, frozen SigLIP vision) to isolate the impact of synthetic data. MolmoBot-SPOC is a lightweight, transform-and-decode policy derived from SPOC with a single-frame frozen SigLIP2 encoder, SigLIP text encoder, bidirectional decoder, and discrete quantile-binned actions (256 bins/dimension), supporting efficient RL fine-tuning and edge deployment.
2. Synthetic Data Generation and MolmoBot-Data
2.1 MolmoBot-Engine and MolmoSpaces
MolmoBot-Engine, built on the MolmoSpaces simulation ecosystem, procedurally generates diverse training episodes. MolmoSpaces comprises 232,000 simulated homes, 48,000 articulated objects (doors, drawers, cabinets), and 11,000 rigid objects. For each episode, task-relevant objects are placed at random 6-DoF poses; environment, lighting, textures, dynamics, camera extrinsics, and initial robot/joint configurations are randomized. Scripted experts (IK for Franka FR3, cuRobo for RB-Y1) generate collision-free demonstration trajectories at high throughput (6 episodes/GPU-hr on 100 A100s).
2.2 Dataset Statistics
MolmoBot-Data consists of 1.8 million expert episodes (299 million frames, 5,817 robot-hours) across eight static and mobile manipulation tasks, as detailed below.
| Task | Robot | Episodes | Frames | Env Count |
|---|---|---|---|---|
| Pick | Franka | 781.8K | 56.9M | 73.2K |
| Pick-Place | Franka | 554.2K | 143.9M | 61.6K |
| Next-To | Franka | 182.7K | 54.7M | 44.9K |
| Color | Franka | 28.6K | 7.5M | 5.3K |
| Door-Open | RB-Y1 | 99.3K | 19.5M | 16.7K |
| Open (drawers) | RB-Y1 | 46.6K | 6.9M | 10.5K |
| Pick | RB-Y1 | 62.7K | 7.5M | 28.1K |
| Pick-Place | RB-Y1 | 14.8K | 2.4M | 9.3K |
The dataset supports supervised training across diverse object categories, environments, and manipulation types (Deshpande et al., 17 Mar 2026).
3. Training Paradigm and Data Augmentation
3.1 Objectives and Optimization
All MolmoBot policies are trained by behavior cloning from MolmoBot-Data. For the main variant, a batch size of 1,024 and AdamW optimizer (learning rate 7, weight decay 8, warmup 2,000 steps) are used for 200K steps on static tasks and 100K on mobile. The flow-matching DiT head samples 9 flow timesteps per example. Trajectories are upsampled in the dataset via grasp retries, additional samples after successful picks, and post-completion augmentation.
MolmoBot-Pi0 and MolmoBot-SPOC are trained under identical or proportionally reduced regimens, the latter using batch size 512, 100K steps, and 256-bin quantile tokenization.
3.2 Domain Randomization and Augmentation
Episodes are rendered with stochastic variation in lighting, texture, mass, friction, damping, camera extrinsics, initial joint offsets, and action noise. Training further applies OFA-style image augmentations (ColorJitter, GaussianBlur, RandomPosterize, RandomSharpness, RandomGrayscale) and procedural language prompt randomization to enhance robustness.
4. Evaluation and Zero-Shot Transfer
4.1 Robotic Platforms and Tasks
Evaluation spans both static (Franka FR3, 7-DoF arm with Robotiq gripper) and mobile (Rainbow Robotics RB-Y1, holonomic base and dual 7-DoF arms) hardware, with real and simulated tasks including pick, pick-and-place, next-to, color sorting, door opening, drawer/cabinet opening, and mobile pick-and-place.
4.2 Zero-Shot Real-World and Simulation Results
Key performance results include:
- Franka FR3 real pick-and-place (120 trials):
- MolmoBot {F=2}: 0
- MolmoBot {F=3}: 1
- MolmoBot-Pi0: 2
- 3-DROID (zero-shot): 4
- 5-DROID (zero-shot): 6
- Franka FR3 held-out simulation (“Pick-MSProc”):
- MolmoBot {2}: 7
- MolmoBot-Pi0: 8
- 9: 0
- 1 + fined: 2
- RB-Y1 mobile manipulation:
- MolmoBot (door opening) in sim: 3 (generalist), 4 (specialist)
- In 9 real-world trials: 4/9 successful grasps, 2/9 full door opens
MolmoBot thus exceeds sim-to-real performance of baselines trained on 5 real-world robot hours, substantiating the impact of extensive, procedurally diverse simulation.
5. Empirical Insights and Ablation Analyses
Dataset and architectural ablations reveal:
- Increasing trajectory count (10K→25K→50K) monotonically improves real pick success (6).
- Object diversity strongly benefits simulation but less so real-world transfer.
- Expanding environment count beyond 1,000 has marginal real-world impact for local tasks, suggesting local workspace factors dominate.
- Flow denoising timesteps 7: real-world success peaks at 8; simulated success increases up to 9.
- Absolute joint targets outperform delta targets for real manipulation (real: 0 absolute vs 1 delta; sim: both 2).
A plausible implication is that scale in both scene and object diversity mainly enhances simulation generalization, while sim-to-real transfer is bottlenecked primarily by local perception-action couplings.
6. Limitations and Prospective Directions
MolmoBot-Engine is currently restricted to rigid and articulated manipulations (doors, drawers, pick-and-place). Extension to contact-rich, deformable, fluid, or soft-body tasks is noted as a key future direction. MolmoBot-SPOC’s compactness and discretized actions are conducive to on-robot RL fine-tuning (e.g., FLaRe) and resource-constrained deployment. Additional research avenues include lightweight real-world finetuning (e.g., LoRA on the DiT head), cross-modal domain adaptation, and improved physical fidelity via learned world models (Deshpande et al., 17 Mar 2026).