MolmoAct2: Open VLA Action Reasoning Model

Updated 3 July 2026

MolmoAct2 is a fully open vision-language-action model that leverages a modular architecture and high-quality datasets to enable robust embodied reasoning.
It integrates a novel VLM backbone with open action tokenization and a flow-matched continuous policy expert for compressing continuous control data.
Empirical evaluations show MolmoAct2 outperforms both open and closed baselines in simulation and real-world benchmarks, achieving superior success rates.

MolmoAct2 is a fully open action reasoning model for vision-language-action (VLA) robotic control, engineered to overcome persistent barriers to real-world deployment. Addressing closed-source baselines, hardware exclusivity, prohibitive reasoning latency, and insufficient fine-tuned success rates, MolmoAct2 integrates a novel VLM backbone, open action tokenization, a flow-matched continuous policy expert, and adaptive-depth reasoning. Its modular, multilayered architecture and diverse, high-quality datasets enable robust performance across a wide spectrum of embodied reasoning and control tasks. Empirical results demonstrate consistently superior performance to leading open and closed models across both simulated and real-world benchmarks, with all model weights, code, and data made fully available for unrestricted academic and applied development (Fang et al., 4 May 2026).

1. Model Architecture and Components

1.1 Molmo2-ER Vision–Language Backbone

MolmoAct2 utilizes Molmo2-ER, a variant of Molmo2 (Qwen3–4B, mid-training checkpoint) as its VLM core. The model is refined via a "specialize-then-rehearse" scheme:

Stage 1 (Embodied Specialization): Fine-tune for 20K steps on 3.3M embodied samples spanning six pillars—single-image embodied QA, image pointing/detection, video embodied QA, multi-image and ego–exo correspondence, and two synthetic diagnostics (CLEVR, GRiD-3D)—with an 8% fraction of Tulu-3 text included. Sequence length is 4,200, batch size 64, and learning rates 5e-6 (vision & connector), 1e-5 (LLM).
Stage 2 (Joint Refinement): Continue for 1.5K steps on a 50/50 embodied-to-original mixture (longer seq-len 16,384, batch=1). The embodied weight $p=0.5$ is selected by Pareto performance over 13 embodied-reasoning tasks.

This approach enables Molmo2-ER to exceed the performance of both open and many closed VLMs, achieving 63.8% average accuracy on embodied-reasoning datasets (compared to 61.3% for GPT-5 Thinking and 61.0% for Gemini-ER1.5).

1.2 OpenFAST Action Tokenizer

OpenFAST compresses 1 second of continuous 32-dimensional robot controls into discrete tokens:

Pipeline: Normalize actions to 1–99th percentile, apply a frequency-domain transform (e.g., Fast Fourier), quantize coefficients, and apply BPE tokenization over a 2,048-token vocabulary.
Training Data: 1M randomly subsampled action chunks from five embodiments: 30% each from MolmoAct2-BimanualYAM, SO100/101, DROID, and 3.3% each from BC-Z, BridgeData V2, RT-1.
Compression Ratio: Typically 8–20 tokens per second.

1.3 Flow-Matching Continuous-Action Expert

Following discrete policy pre-training, a 36-layer DiT-style flow-matched continuous policy ("flow expert") is attached via per-layer key–value (KV) cache conditioning:

Objective: For normalized chunk $a$ , noise $\epsilon \sim \mathcal{N}(0, I)$ , $t \in [0,1]$ , define:

$x_t = (1-t)\epsilon + t a, \quad u^\star = a - \epsilon$

Minimize the loss:

$\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{a,\epsilon,t}\left\| m \odot \left( f_\theta(x_t, t, c) - u^\star \right) \right\|_2^2$

where $m$ masks padded time steps/dimensions and $c$ is VLM context.

KV-Cache Conditioning: From each VLM layer $\ell$ , keys $K^{vlm}_\ell$ and values $a$ 0 are projected into the expert head space and cross-attended by the expert block:

$a$ 1

Gradients are detached for knowledge insulation; $a$ 2 only updates the expert and its adapters.

Post-Training Joint Objective:

$a$ 3

1.4 MolmoAct2-Think: Adaptive-Depth Reasoning

MolmoAct2-Think introduces an adaptive-depth interface to mitigate VLM grounding latency:

Depth Prediction: A depth-token prefix encodes a $a$ 4 grid of 128 VQ-codes from a pretrained VQ-VAE model.
Adaptive Update: At each timestep, 10×10 RGB patch similarities to prior frames are computed. Only cells with cosine similarity $a$ 5 are autoregressively re-predicted; unchanged cells replay cached codes, reducing depth decode cost commensurately.
Training: Mixed action-only, depth-only, depth-and-action examples; 10% noise injection to depth codes; learned per-layer depth gate on expert KV conditioning. Fine-tuning on LIBERO yields a +0.9% average success lift.

2. Training Data and Dataset Curation

MolmoAct2’s high performance is grounded in a suite of curated, open datasets spanning a range of hardware and task diversity.

Summary of MolmoAct2 Datasets

Dataset	Hours	Demos/Episodes	Key Features
BimanualYAM	720	34.5K demos	<$6k bimanual robot; 3 RGB cams; 28 tasks
SO100/101	184	38,059	1,222 HuggingFace users; rigorous quality filtering
DROID (Franka)	—	74,604	Idle segment filtering, extended annotations
Other (BC-Z, RT-1, etc.)	—	—	Smaller academic and web mixtures

BimanualYAM: Collected on a <$6k bimanual robot, 34,500 demos, 720 hours, over 28 tasks with rich scene variability and quality-controlled protocols.

SO100/101: Aggregated from 1,222 LeRobot HuggingFace users; 38,059 episodes, ~184 hours, with filtering for schema validity, licensing, codebase eligibility, and a TOPReward gate for quality.

DROID: Filtered Franka dataset; 74,604 successful episodes with idle removal and expanded natural language annotations.

Auxiliary Sources: BC-Z, BridgeData V2, RT-1, MolmoAct1 for broader embodiment coverage.

Language re-annotation leverages Qwen3.5-27B to increase lexical diversity, doubling unique instruction token frequency from 22% to 46%.

3. Training, Optimization, and Fine-Tuning Procedures

3.1 Pre-Training (MolmoAct2-Pretrain)

Data Loader: 90% robot, 10% multimodal.
Robot Mixture: 30% each BimanualYAM, SO100/101, DROID; remainder from smaller sources.
Optimizations: AdamW with decoupled weight decay; learning rates 5e-6 (vision/connector) and 1e-5 (LLM); 200K updates with a global batch of 128 (64 × 2 nodes × 4 GPUs), ~5,800 GPU-hours.
Sequence Packing: Dynamic, up to 4,200 tokens per sample from text, image, state, and action.

3.2 Post-Training (MolmoAct2-Post)

Add flow expert and per-layer KV bridge.
Same data mixture but shorter robot sequence length (2,100).
Four flow samples per chunk, joint objective $a$ 6.
Expert learning rate 5e-5; 100K updates at global batch 128, ~2,300 GPU-hours.

3.3 Embodiment-Specific Fine-Tuning

Identical architecture; robot-only data; eight flow samples per chunk; gradients unblocked to VLM for adaptation.
YAM: 100K updates @64 GPUs.
DROID: 74,604 episodes, 100K updates @32 GPUs.
SO100/101: randomize camera order, same budget.
LIBERO: front + wrist cams, sampled at 10 Hz, 50K updates @32 GPUs.

4. Empirical Evaluation and Benchmarks

MolmoAct2 establishes new state-of-the-art on both open and closed VLA and embodied-reasoning benchmarks. Performance metrics are strictly reported as in (Fang et al., 4 May 2026).

4.1 Simulation and Real-World Task Benchmarks

Simulation: MolmoSpaces (single-cycle pick-and-place), MolmoBot (challenging pick-and-place), 1,000 episodes per task.
- MolmoBot avg success: 35.6% vs. 18.1% ( $a$ 7 baseline)
- MolmoSpaces avg: 20.6% vs. 10.0%
Real-World Zero-Shot:
- DROID (5 tasks, 15 trials each): 87.1% vs. 45.2% ( $a$ 8), 48.4% (MolmoBot)
- SO100/101 (5 tasks, 15 trials): 56.7% vs. 45.3% ( $a$ 9) and 2.3% (SmolVLA)

4.2 Embodied-Reasoning and Downstream Fine-Tuning

Embodied-Reasoning (13 tasks): Molmo2-ER 63.8% avg, exceeding Gemini-ER1.5 (61.3%) and GPT-5 (57.9%).
LIBERO: MolmoAct2 avg 97.2% vs. $\epsilon \sim \mathcal{N}(0, I)$ 0 96.9%, GR00T 97.0%.
RoboEval: MolmoAct2 44.3% vs. $\epsilon \sim \mathcal{N}(0, I)$ 1 40.5%, MolmoAct1 30.2%.
YAM Real-World Suite (8 tasks): MolmoAct2 50.1% vs. runner-up 35.1%.
Adaptive-Depth (MolmoAct2-Think): LIBERO 98.1% vs. MolmoAct2 97.2%, with ablations dropping to 97.65%/97.50% for depth noise and gate, respectively.

5. Deployment and Practical Integration

5.1 Hardware and Latency

Training: Up to 64 × A100/H100 GPUs.
Inference (1 × H100, action horizon=10):
- MolmoAct2: 23.0 Hz baseline, 27.4 Hz (+1.19× with caching), 55.8 Hz (+2.04× CUDA Graph).
- MolmoAct2-Think: 8.0 Hz, 9.7 Hz, and 12.7 Hz after respective optimizations.

5.2 Ablations and Best Practices

Per-layer KV conditioning surpasses hidden-state approaches.
Joint discrete and continuous co-training and full fine-tuning yield better results than LoRA or expert-only alternatives.
On-the-fly sequence packing, image augmentations, and randomized camera order promote generalization.
For adaptive depth, per-layer depth gates and 10% train-time noise are recommended.

5.3 Real-World Integration Guidelines

Match camera calibration and order to training metadata; randomize order for each embodiment class.
Adhere to standardized normalization protocols and leverage the released OpenFAST code/data.
Small-batch in-domain fine-tuning (<100K steps) is effective for rapid adaptation.
Exploit cached KV and CUDA Graph APIs to sustain >50 Hz closed-loop control rates.

All model weights, training code, datasets, and OpenFAST components are released at https://github.com/allenai/molmoact2 and on HuggingFace, facilitating reproducible research and deployment (Fang et al., 4 May 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MolmoAct2: Action Reasoning Models for Real-world Deployment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoAct2.