MolmoAct-7B-D: Action Reasoning Model

Updated 12 August 2025

MolmoAct-7B-D is an open-source action reasoning model that transforms visual and linguistic inputs into robotic commands using a structured three-stage pipeline.
It integrates depth perception tokens, mid-level visual trajectory planning, and low-level action tokens to ensure explainable and steerable robotic behavior.
The model demonstrates competitive performance in both simulation and real-world settings while providing full access to training code, datasets, and model weights.

MolmoAct-7B-D is an open-source action reasoning model within the Action Reasoning Models (ARMs) paradigm, specifically designed for robot learning and vision–language–action understanding. Distinguished by its structured three-stage pipeline, MolmoAct-7B-D fuses perception, planning, and control to execute purposeful, explainable robotic actions. The model’s performance is substantiated across both simulation and real-world robotics settings, with comprehensive value delivered through open release of all code, model weights, and a new robot trajectory dataset. MolmoAct-7B-D builds on the architectural and developmental principles demonstrated in open models such as Moxin 7B, manifesting a blueprint for spatially grounded reasoning in next-generation robotic foundation models (Lee et al., 11 Aug 2025, Zhao et al., 2024).

1. Structured Vision–Language–Action Architecture

At its core, MolmoAct-7B-D implements a tripartite pipeline for transforming raw observations and linguistic instructions into robotic actions, ensuring semantic and spatial grounding:

Stage 1: Depth Perception Tokens Raw RGB images $I$ and textual instructions $T$ are encoded into a “depth perception string” via discretized 3D scene understanding. This employs a pretrained VQVAE codebook ( $N=128$ ), mapping a dense depth map into $M=100$ tokens, formatted as $d = (\langle\text{DEPTH}_\text{START}\rangle, \langle\text{DEPTH}_{k_1}\rangle, \dots, \langle\text{DEPTH}_{k_M}\rangle, \langle\text{DEPTH}_\text{END}\rangle)$ , which the model then produces autoregressively.
Stage 2: Visual Reasoning Trace The model synthesizes a compact trajectory sketch—a polyline $\tau = (p_1, \dots, p_L)$ with points $p_i = (u_i, v_i)$ normalized within $[0, 255]$ —that encodes a mid-level spatial plan. This explicit plan can be interactively edited and is spatially aligned with the input scene.
Stage 3: Low-Level Action Tokens Conditioned on $\mathcal{d}$ and $\tau$ , the model outputs executable control commands as action tokens. The joint autoregressive modeling factorizes as

$p(\mathcal{d}, \tau, \mathcal{a} \mid I, T) = \prod_{i=1}^{M+2} p(d_i \mid I, T, \mathcal{d}_{<i}) \times \prod_{j=1}^{L} p(\tau_j \mid I, T, \mathcal{d}, \tau_{<j}) \times \prod_{k=1}^{D} p(a_k \mid I, T, \mathcal{d}, \tau, \mathcal{a}_{<k})$

guaranteeing that the perception and spatial plan condition the final control sequence.

This compositional structure enforces semantic, spatial, and causal disentanglement between perception, planning, and control, yielding actions that are both interpretable and steerable.

2. Training Procedures and Dataset Construction

MolmoAct-7B-D leverages multi-phase training strategies, supported by open-source infrastructure and a novel robot trajectory dataset:

Multi-phase Training Pretraining combines diverse visual and linguistic corpora, applying architectural techniques from Moxin 7B such as Colossal-AI for parallelization, AdamW optimization (with hyperparameters $\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=1\times10^{-8}$ , weight decay $0.1$), and scheduling (cosine decayed learning rate from $2\times10^{-6}$ to $2\times10^{-7}$ ).
MolmoAct Dataset (Mid-Training) The proprietary MolmoAct Dataset comprises 10,689 robot trajectories over 93 manipulation tasks using the Franka single-arm platform. Approximately 7,730 trajectories were collected in dynamic home environments (73 tasks, 20 verbs), and 2,959 in controlled tabletop scenarios. Integration of this dataset during mid-training yields a mean $+5.5$ \% improvement in manipulation performance and generalization.
Alignment and Fine-Tuning Post-training steps target enhanced capability alignment, further refining behavioral safety and helpfulness, and adapting models to ambiguous or open-ended tasks.

The training pipeline is fully transparent, with all scripts, checkpoints, and datasets released for open research use (Zhao et al., 2024).

3. Performance Metrics and Benchmarking

MolmoAct-7B-D’s capability is demonstrated across standard robotics and embodied AI benchmarks:

Task or Setting	Metric	Value / Gain
SimplerEnv Visual Matching	Zero-shot accuracy	70.5%
LIBERO	Avg. success (sim)	86.6%
LIBERO long-horizon	Success gain over ThinkAct	+6.3%
Real-world, single-arm	Task progression gain	+10%
Real-world, bimanual	Task progression gain	+22.7%
Out-of-distribution	Generalization gain	+23.3%

Performance on SimplerEnv and LIBERO benchmarks illustrates competitive spatial and object reasoning. Out-of-distribution evaluations confirm robust generalizability, outperforming closed-source and state-of-the-art baselines such as Pi-0, GR00T N1, SpatialVLA, and OpenVLA even with relatively lower sample counts.

4. Real-World Robotic Application and Steerability

MolmoAct-7B-D is evaluated in both simulated and physical settings, supporting direct deployment and interactive use:

Task Progression In experiments with Franka robots, task progression metrics improved by $+10$ \% in single-arm and $+22.7$ \% in bimanual configurations compared to prior baselines such as Pi-0-FAST.
Trajectory Steering and Open-Ended Instruction The model enables user-driven interaction—ambiguous instructions (e.g., “pick up the bowl”) can be disambiguated via trajectory sketch overlays, empowering operators to steer robot behavior in real time. Mid-level visual planning traces serve as the interface for such edits.
Preference Metrics Human preference (Elo ratings) systematically favor MolmoAct-7B-D over competitive models for open-ended instructions and multimodal interaction.

This architecture brings explainability and modularity to robotics, facilitating correction, troubleshooting, and semantic control not available in direct perception-to-action models.

5. Open-Source Contributions and Reproducibility

A central feature of MolmoAct-7B-D is exhaustive open-source publication, fulfilling the Model Openness Framework and supporting robust reproducibility:

Released Components
- Model weights for MolmoAct-7B-D and variant MolmoAct-7B-O
- Complete training code, from pretraining through fine-tuning and mid-training on the MolmoAct Dataset
- All robot trajectory data, including depth auxiliary labels, visual traces, and trajectory-conditioned action ground truth
Research Impact
- The open blueprint enables retraining, adaptation, and extension for vision–language–action research.
- Researchers can analyze and replicate data curation (e.g., MinHashLSH deduplication), optimization, and task-specific pipeline formation.

A plausible implication is accelerated advancement and methodological benchmarking in spatially grounded embodied AI, given the lowered barrier for entry and the completeness of released materials (Lee et al., 11 Aug 2025, Zhao et al., 2024).

6. Context within Model Innovation and Structured Reasoning

MolmoAct-7B-D advances the field by architectural separation between perception, planning, and control, which contrasts with earlier direct mapping models in robotic foundations. This explicit division is suggested to foster better generalization, interactive steerability, and semantic alignment, as evidenced by the model’s performance and human-preference metrics.

The use of depth-encoded perception and editable mid-level planning traces introduces explainability absent in latent-only or end-to-end control paradigms, positioning ARMs as a key direction for interpretable and high-performance embodied AI.

7. Significance, Limitations, and Future Prospects

MolmoAct-7B-D’s comprehensive release of resources and robust empirical performance make it a significant reference point for future vision–language–action reasoning models. Its modular pipeline and real-world evaluation set a precedent for reproducible, customizable, and interpretable robotics foundation models.

Potential limitations include the unexplored upper limits of the model’s scalability in diverse robotic platforms, and the extent of transferability for more complex or fully unstructured environments. A plausible implication is that the underlying design choices—modular depth tokens, trajectory traces, explicit conditioning—will be further refined as fused multimodal perception and action reasoning models evolve. The blueprint established may inform subsequent advances in spatial reasoning, open-ended task interaction, and dataset synthesis within embodied AI.