Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

MolmoAct Dataset: Robotics Action Benchmark

Updated 12 August 2025
  • The MolmoAct Dataset is a curated open-access benchmark of 10,689 robot trajectories supporting ARMs for structured spatial reasoning.
  • It integrates multimodal data—RGB images, depth tokens, natural language instructions, and visual traces—to enable interpretable and steerable robot behavior.
  • Utilizing MolmoAct in training yields significant performance gains, boosting single-arm tasks by up to 10% and bimanual tasks by 22.7%.

The MolmoAct Dataset is a curated, open-access benchmark for robotics foundation models, designed to support the development and evaluation of Action Reasoning Models (ARMs) capable of structured spatial reasoning across perception, planning, and control modalities. Comprising over 10,000 high-quality multimodal manipulation trajectories, it facilitates research in interpretable, generalizable, and steerable robot behavior. The dataset is a central component of the MolmoAct model pipeline and corresponds to the first publicly released large-scale dataset supporting mid-level spatial planning and depth-aware perception in robot learning (Lee et al., 11 Aug 2025).

1. Dataset Composition and Provenance

The MolmoAct Dataset contains 10,689 robot trajectories, each generated by a single-arm Franka robot. Data collection was performed by five operators under rigorous protocols over a two-month period, with an emphasis on consistency and precise annotation. Each trajectory is long-horizon, averaging 112 timesteps per instance, and includes:

  • RGB visual observations (images at each timestep)
  • Natural language instructions describing the task
  • Depth-aware perception tokens generated from observed scenes
  • Visual reasoning traces encoding planned spatial trajectories as sequences in the image plane
  • Action tokens capturing the robot’s low-level motor commands

This multi-level annotation allows the dataset to explicitly support a three-stage reasoning process: mapping language and visual input to structured intermediate spatial plans, and then to executable actions.

2. Data Structure and Technical Format

Each trajectory in the MolmoAct Dataset consists of tightly coupled multimodal data, specifically structured to enable autoregressive sequence modeling:

Modality Format Purpose
RGB Images Per-timestep pixel arrays Visual context
Natural Language Tokenized instruction string High-level task goal
Depth Tokens Sequence (⟨DEPTH_START⟩, ⟨DEPTHₖ₁⟩, …, ⟨DEPTHₖ_M⟩, ⟨DEPTH_END⟩), M codes per timestep Scene geometry representation
Visual Traces Polyline: sequence τ = (p₁, ..., p_L), pᵢ ∈ [0, 255]² Spatial anchoring for plan
Action Tokens Token code for each low-level motor command per timestep Executable control

Depth tokens are derived from a learned codebook and represent a compact categorical summary of depth information for each RGB frame. Visual reasoning traces τ = (p₁, ..., p_L) are explicit 2D polylines denoting planned manipulator trajectories, with positions pᵢ normalized to [0, 255]. Action tokens correspond to real-valued joint-space or end-effector commands, discretized for autoregressive modeling.

3. Task Diversity and Scenarios

The dataset spans 93 unique manipulation tasks, grouped into two principal categories:

  • Home Environments: 7,730 trajectories capture long-horizon or compositional tasks across living rooms, kitchens, bathrooms, and bedrooms. Examples include “put the bowl in the dishwasher,” “close the laptop,” “clean the toilet,” and “cover the pot,” with each high-level instruction further decomposed into a sequence of substeps annotated in both language and trajectory space.
  • Tabletop Environments: 2,959 trajectories cover 20 fundamental manipulation actions or motion primitives, such as picking, flipping, and precise placement, using a variety of household objects.

Task instructions exhibit a broad natural verb distribution, with frequent actions including “put,” “turn,” and “close.” This breadth captures both atomic and compositional skills, forming the behavioral basis for advanced generalization studies.

4. Role in Model Training and Performance Gains

The MolmoAct Dataset is employed during mid-training of the MolmoAct-7B-D model to further specialize generic foundation models toward realistic spatial reasoning and manipulation. Ablation experiments demonstrate that integrating this dataset during mid-training produces a statistically significant average improvement of 5.5% in general task performance over the base model.

Performance increases are particularly pronounced in real-world fine-tuning scenarios:

  • Single-arm tasks: +10% task progression over Pi-0-FAST.
  • Bimanual tasks: +22.7% improvement in progression relative to the same baseline.

These gains are mirrored in simulation benchmarks (LIBERO, SimplerEnv), where MolmoAct models demonstrate enhanced robustness to distribution shift and superior out-of-distribution generalization (outperforming baselines by +23.3%).

5. Technical Contributions and Modeling Innovations

Structurally, the dataset’s composition directly supports the MolmoAct model’s three-stage cascade of perception→planning→action:

  • Stage 1: Language and RGB input are encoded to depth tokens via perceptual transformers with a learned codebook.
  • Stage 2: Mid-level plans are realized as auto-regressively predicted visual traces in the image plane, permitting human-in-the-loop editing and interpretability.
  • Stage 3: Visual traces are decoded to fine-grained action tokens representing low-level control.

This facilitates both explainable and steerable policies, since each trajectory’s intermediate spatial anchor (the visual reasoning trace) and final executed action are explicitly linked to the underlying spatial and linguistic context.

6. Significance, Accessibility, and Comparison

The open release of the MolmoAct Dataset marks a pivotal advance in robot learning infrastructure. Historically, robot learning has been limited by access to high-quality, multimodal and semantically dense trajectory data. MolmoAct’s substantial scale and structure position it as a reference benchmark for ARMs and related spatial reasoning models.

A notable distinction is its integration of depth-aware perception and explicit plan representations, which are less commonly available in prior datasets. Its structured design supports research into explainable robotics and provides a standardized protocol for benchmarking grounded action reasoning in manipulation.

7. Implications and Future Uses

By making the MolmoAct Dataset openly accessible, the authors set a new standard for reproducible research and scalable robot model training. The dataset not only boosts quantitative performance metrics but also enables qualitative studies into explainability, plan modulation, and semantically grounded control. A plausible implication is accelerated progress in interpretable robot behavior, where each model decision can be audited through its multimodal trace and intermediate representations.

The dataset, its collection protocols, and annotation format may further inform future efforts across related domains, including multi-task, multi-step manipulation, task decomposition, and fine-grained human-robot interaction research.


In summary, the MolmoAct Dataset is a comprehensive, systematically collected, multimodal resource that substantially advances both the training and scientific evaluation of spatial reasoning and action understanding in embodied AI and robot learning (Lee et al., 11 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)