MolmoBot-Data: Synthetic Robotics Demonstrations
- MolmoBot-Data is a fully-synthetic dataset featuring expert demonstrations for articulated object manipulation and pick-and-place robotics tasks.
- It is generated using the MolmoBot-Engine and MolmoSpaces pipelines, which create diverse indoor environments with extensive domain randomization and detailed annotations.
- The dataset supports robust zero-shot sim-to-real transfer with high success rates across varied tasks and robot platforms, underscoring its significance for robotics research.
MolmoBot-Data is a large-scale, fully-synthetic dataset of expert demonstrations for articulated object manipulation and pick-and-place robotics tasks. Developed to support zero-shot sim-to-real transfer, it is a foundational resource for training and benchmarking embodied vision-language manipulation agents that generalize to unseen physical environments with no reliance on real-world data collection or finetuning (Deshpande et al., 17 Mar 2026).
1. Data Generation and Composition
MolmoBot-Data is generated end-to-end using MolmoBot-Engine, an open-source simulation pipeline built on top of MolmoSpaces. MolmoSpaces provides 232,000 procedurally generated indoor environments such as houses, offices, kitchens, and bedrooms, with diverse layouts and semantic metadata. For each episode, an environment seed is sampled (diversity ~94,300 seeds used in the dataset), with task-relevant assets—either rigid pickup objects or articulable elements—randomized in 6-DoF pose, subject to collision and reachability constraints.
Asset diversity is achieved by sourcing pickup objects from AI2-THOR and Objaverse, filtered for graspability (≤ 50 cm × 50 cm × 15 cm) and watertight colliders. The finalized dataset spans 11,400 unique pickup objects and 9,400 receptacle classes. Extensive domain randomization is applied across lighting (number, type, position, color, intensity), textures (procedural and real), object dynamics (mass, friction), and 6-DoF pose of objects and the robot base to ensure broad generalization.
Demonstrations are collected on two platforms:
- Franka FR3: 7-DoF arm with parallel-jaw gripper, fixed pedestal.
- Rainbow Robotics RB-Y1: holonomic 3-DoF base, 6-DoF torso, 2-DoF head, dual 7-DoF arms.
Episodes utilize randomized joint configurations, action noise injection (truncated Gaussian in TCP and base spaces), and camera pose perturbations. Demonstrations are produced by an expert policy (Ï€*) that samples precomputed 6-DoF grasps, executes phase-based closed-loop trajectories, and discards episodes that fail after three retries.
2. Dataset Statistics and Structure
MolmoBot-Data comprises 1.8 million trajectories, containing 299 million frames across a total experience of 5,817 hours. Average trajectory length is 11.8 seconds, but varies by task from 4–20 seconds. The dataset was generated in approximately 4,500 GPU-hours, leveraging parallel simulation (100×A100 at 1,024 eps/GPU-h). Episode-wise, the dataset is organized into discrete tasks and robot platforms, as summarized below:
| Task | Robot | Episodes | Environments | Objects/Rec. | Avg. Length (s) | Total Hours |
|---|---|---|---|---|---|---|
| Pick | Franka | 781.8k | 73.2k | 10.7k | 4.8 | 1,042 |
| Pick-and-Place | Franka | 554.2k | 61.6k | 7.0k / 494 | 17.1 | 2,638 |
| Pick-and-Place-NextTo | Franka | 182.7k | 44.9k | 931 / 9.0k | 20.1 | 1,022 |
| Pick | RB-Y1 | 62.7k | 28.1k | 4.2k | 10.6 | 185 |
| Pick-and-Place | RB-Y1 | 14.8k | 9.3k | 2.4k / 183 | 14.0 | 58 |
| Door-open | RB-Y1 | 99.3k | 16.7k | – | 19.6 | 542 |
| Open (drawers) | RB-Y1 | 46.6k | 10.5k | 217 | 14.8 | 192 |
| PnP-Color | Franka | 28.6k | 5.3k | 3.1k / 183 | 17.4 | 138 |
Tasks include pick, pick-and-place (standard, next-to, colored receptacle), opening drawers and doors, and mobile pick-and-place. Each task includes both RGB multi-view sensor streams (up to 5 cameras per robot, 624×352–1024×576 resolution), proprioceptive state (joint positions/velocities, TCP and base poses), action targets, and a rich set of annotations and privileged metadata (object start/goal pose, grasp indicators, policy phase, retry counts, and 2D keypoints).
3. Observations, Actions, and Policy Architectures
Observations comprise multi-view and multi-frame RGB images, proprioceptive robot state, and synchronized action labels. MolmoBot-Data is formatted for compatibility with vision-language-action (VLA) policy classes:
- MolmoBot: Uses a frozen SigLIP2 vision encoder (192 tokens/image), fine-tuned Molmo2-4B text encoder, and a flow-matching diffusion transformer (DiT) action head that predicts H=16-step action chunks in continuous joint or base velocity space.
- MolmoBot-Î â‚€: Implements the Paligemma 3B vision-LLM with flow-matching head, mirroring the architecture for direct baseline comparison.
- MolmoBot-SPOC: Lightweight transformer, SigLIP vision and text, proprioceptive MLP, and discrete action quantization (256 bins per dimension) via non-causal decoder.
The flow-matching loss for MolmoBot is:
with as the DiT head, as a continuous denoising index, and as the distribution of expert rollouts.
4. Annotation, Noise Models, and Metadata
All demonstrations are executed and logged in closed-loop in MuJoCo (15 Hz, Franka; 10 Hz, RB-Y1). Each episode records RGBs, camera intrinsics/extrinsics, robot state, action labels (absolute/delta joint positions, TCP twist, gripper command), and task-relevant metadata.
Noise models include:
- Action noise: Added during demonstration only, Gaussian in TCP and base actions, clipped (e.g., σ_pos = 0.1·||Δx||, up to ±2 cm).
- Camera perturbation: ±1–2 cm/±4–8° (platform-dependent extrinsic noise).
- Domain randomization: Applied to texture, lighting, and dynamics throughout.
Privileged metadata includes 3D bounding boxes, mesh collision flags, grasp information, and phase/attempt tracking for granular policy analysis.
5. Usage, Preprocessing, and Training Protocols
Recommended data usage leverages multitask sampling ratios (e.g., Franka: Pick 20%, Pick-and-place 45%, Next-to 20%, Color 15%; RB-Y1: Open 20%, Door-open 20%, Pick 30%, Pick-and-place 30%). Augmentation protocols encompass ColorJitter, GaussianBlur, posterization, sharpness, and grayscale, supplemented by natural-language prompt randomization using CLIP and LLM-based referral expressions.
Key training practices include:
- Freezing pretrained vision encoders for MolmoBot/MolmoBot-Î â‚€; end-to-end training for MolmoBot-SPOC with discrete quantile binning.
- Batch size 1,024; 200k steps for static tasks, 100k for mobile.
- Denoising time steps , with sim↔real performance maximized at –$8$.
- Special upsampling of retry/completion phases and successful picks.
- Learning rates: (MolmoBot), (MolmoBot-Î â‚€).
6. Baseline Results and Scaling Behaviors
In zero-shot real-world evaluation (Franka, 15 Hz), MolmoBot with 0 achieves 79.2% pick-and-place success across four environments—surpassing 1 (39.2%) and MolmoBot-Π₀ (46.7%). In held-out simulation, MolmoBot (F=2) attains mean oracle success of 66.4% across pick, pick-and-place, next-to, and color tasks, as detailed below:
| Model | Avg Sim. Success (%) |
|---|---|
| π_{0.5} (zero-shot) | 20.6 |
| π_{0.5}-Finetuned | 36.6 |
| MolmoBot-Î â‚€ | 41.4 |
| MolmoBot (F=2) | 66.4 |
| MolmoBot (F=3) | 64.4 |
Scaling analysis indicates:
- Pick task success scales monotonically from 10k to 50k episodes.
- Object diversity is important for simulation benchmarks and less critical for real evaluation with common objects.
- Broad environment diversity shows negligible effect for common pick tasks.
- Absolute joint targets outperform delta representations for real-robot deployment.
- Optimal sim performance at 2 denoising samples; real performance peaks at 3.
7. Implications and Community Access
MolmoBot-Data demonstrates that procedural scale (1.8M trajectories, 11.4k object classes, 94.3k environments) and task diversity permit robust zero-shot sim-to-real transfer, obviating the need for real-world data acquisition or task-specific finetuning in both static and articulated mobile manipulation (Deshpande et al., 17 Mar 2026). The recommended usage involves freezing pretrained vision encoders, heavy augmentation, prompt and task mixing per specified ratios, and aggressive phase upsampling.
The dataset, alongside the full pipeline (MolmoBot-Engine, MolmoSpaces), is available for research use to facilitate rigorous evaluation and extension of generalist embodied agents in broad robotics settings.