MolmoAct: Robotic Action Reasoning Model
- MolmoAct is a robotic action reasoning model that explicitly structures depth perception, spatial planning, and control to enhance manipulation transparency.
- It employs depth-aware visual tokens and editable spatial plans to produce mid-level trajectories, resulting in improved performance on simulation and real-world benchmarks.
- The MolmoAct Dataset, featuring over 10,000 annotated trajectories, underpins robust generalization and user-controllable action steering.
MolmoAct is a robotic action reasoning foundation model that integrates perception, planning, and control in a structured pipeline, advancing the capabilities and transparency of robot manipulation systems by explicitly encoding spatial reasoning and enabling mid-level trajectory steering. It utilizes depth-aware visual tokens, editable spatial plans, and sequential action prediction to achieve high performance across simulated and real-world robotics benchmarks, demonstrating robust generalization and user-controllable behaviors. The MolmoAct Dataset, a novel collection of annotated robot trajectories, underpins its training and performance improvements.
1. Model Architecture and Structured Reasoning Pipeline
MolmoAct is predicated on an autoregressive sequence-to-sequence formulation that factorizes robotic decision-making into three explicit stages:
- Depth Perception Stage: Initial inputs—visual observations and task instructions—are encoded through a Vision Transformer (ViT) and projected into an LLM-compatible space. The model generates a sequence of tokens representing coarse depth perception (), providing 2.5D scene representations grounded in spatial metrics.
- Spatial Planning Stage: Conditioned on depth tokens (), MolmoAct predicts a visual reasoning trace ()—a spatial trajectory, typically rendered as a 2D polyline on the input image, indicating planned end-effector movement. These traces provide mid-level manipulation plans.
- Action Control Stage: Guided by both depth () and trace (), the model outputs action tokens (), which represent discretized robot control primitives: spatial translations, rotations, and gripper signals.
The model’s full reasoning chain conditional probability is:
where denotes observations (images), denotes instruction tokens, and collectively form a “chain-of-thought” action reasoning trajectory.
2. MolmoAct Dataset: Diversity and Performance Enhancement
The MolmoAct Dataset is central to model performance, comprising over 10,000 robot trajectories contextualized across home and tabletop scenarios. Each trajectory encodes high-fidelity spatial and temporal information, leveraging multi-view imagery (external and end-effector-mounted cameras) and diverse manipulation goals (e.g., closure, pouring, cleaning).
Mid-training with this dataset yields an average 5.5% improvement in general performance metrics, including success rate, stability, and out-of-distribution adaptability. Ablation studies attribute this gain to precise depth perception calibration and enhanced trajectory-to-action alignment—features directly supported by the curated variety and annotation quality of the MolmoAct Dataset.
3. Simulation and Real-World Benchmark Results
MolmoAct-7B-D demonstrates competitive performance on leading simulation and hardware benchmarks:
- SimplerEnv Visual Matching: Achieves a 70.5% zero-shot success rate, surpassing closed-source baselines such as Pi-0 and GR00T N1.
- LIBERO Suite (Franka Emika Panda arm): Reports 86.6% average success rate across spatial, goal-directed, and long-horizon manipulation tasks, with a +6.3% margin over ThinkAct on long-horizon routines.
- Real-World Fine-Tuning: Yields +10% improvement (single-arm) and +22.7% improvement (bimanual) in task progression over Pi-0-FAST across scenarios such as table bussing, towel folding, and household arrangement.
- Out-of-Distribution Generalization: Delivers +23.3% performance over baselines in settings with linguistic, spatial, and object domain shifts.
Human-preference evaluations rank MolmoAct at the forefront of open-ended instruction adherence and responsive trajectory steering.
4. Explainable, Steerable Action Reasoning
MolmoAct instantiates explainability and steerability at the architectural and operational level:
- Explainability: The intermediate depth tokens decode into spatial maps, and visual traces overlay on the source images as explicit movement plans. This makes both perception and planning stages auditable; practitioners can inspect whether planned traces are spatially coherent prior to execution.
- Steerability: Users may directly modify the trace—either through manual drawing interfaces or via natural language correction—and the model will condition subsequent action tokens on the edited plan. This design facilitates interactive, user-in-the-loop robotic programming, supporting real-time adaptation to ambiguous or underspecified instructions.
Distinct from direct vision-to-control mapping models, the chain-of-thought factorization ensures that every low-level command is traceable to interpretable, modifiable mid-level plans and depth estimates.
5. Theoretical and Practical Implications of Structured Reasoning
MolmoAct’s explicit three-stage pipeline provides a formal decomposition of the robot action inference problem:
- Formalism: The model operationalizes a conditional autoregressive sequence factorized along cognitive primitives—first perception, then planning, finally control—mirroring human reasoning in purposeful manipulation.
- Data Efficiency: By leveraging both broad pretraining and the MolmoAct Dataset’s targeted high-quality trajectories, the system attains strong performance without requiring extreme data volume or closed-source resources.
- Blueprint for Future Robotic Reasoning Models: The modularity and interpretability set a clear template for subsequent foundation models in robotics, potentially enabling further integration of perceptual modules, plan editing interfaces, and continuous learning architectures.
A plausible implication for the field is that future robotics models may further generalize and adapt by deepening the mid-level reasoning formalism or extending the range of editable intermediate plans.
6. Comparison to Prior Approaches and Generalization
MolmoAct distinguishes itself from prior robotic foundation models by:
- Employing structured intermediate representations rather than direct perception-to-action mapping, which improves generalization and robustness.
- Surpassing key baselines (Pi-0, GR00T, ThinkAct) in zero-shot, fine-tuned, and OOD benchmarks.
- Enabling for the first time open-weight, open-dataset robotic reasoning at this scale, leading to reproducibility and community-driven progress.
A plausible implication is that research efforts transitioning towards interpretable and steerable architectures, built on curated, diverse trajectory datasets, may yield further gains in adaptability, human-robot interaction, and cross-domain compositional reasoning.
7. Future Directions
MolmoAct’s design suggests several future research avenues:
- Expansion of intermediate spatial reasoning traces to fully 3D representations for more complex manipulation tasks.
- Incorporation of additional sensor modalities (tactile, force, auditory) into the perception tokenization process.
- Development of multi-agent collaborative steering, allowing multiple humans or AI systems to jointly edit or influence robot plans in situ.
- Continuous web-scale dataset collection and incremental learning in real-world environments.
- Open, standardized benchmarks based on MolmoAct’s dataset and model interface to facilitate cross-system comparison.
This suggests a trajectory where robotic reasoning models evolve towards ever greater transparency, user steerability, and semantic grounding, with open resources accelerating both application and theoretical advances in the discipline.