MolmoAct2: Open VLA Action Reasoning Model
- MolmoAct2 is a fully open vision-language-action model that leverages a modular architecture and high-quality datasets to enable robust embodied reasoning.
- It integrates a novel VLM backbone with open action tokenization and a flow-matched continuous policy expert for compressing continuous control data.
- Empirical evaluations show MolmoAct2 outperforms both open and closed baselines in simulation and real-world benchmarks, achieving superior success rates.
MolmoAct2 is a fully open action reasoning model for vision-language-action (VLA) robotic control, engineered to overcome persistent barriers to real-world deployment. Addressing closed-source baselines, hardware exclusivity, prohibitive reasoning latency, and insufficient fine-tuned success rates, MolmoAct2 integrates a novel VLM backbone, open action tokenization, a flow-matched continuous policy expert, and adaptive-depth reasoning. Its modular, multilayered architecture and diverse, high-quality datasets enable robust performance across a wide spectrum of embodied reasoning and control tasks. Empirical results demonstrate consistently superior performance to leading open and closed models across both simulated and real-world benchmarks, with all model weights, code, and data made fully available for unrestricted academic and applied development (Fang et al., 4 May 2026).
1. Model Architecture and Components
1.1 Molmo2-ER Vision–Language Backbone
MolmoAct2 utilizes Molmo2-ER, a variant of Molmo2 (Qwen3–4B, mid-training checkpoint) as its VLM core. The model is refined via a "specialize-then-rehearse" scheme:
- Stage 1 (Embodied Specialization): Fine-tune for 20K steps on 3.3M embodied samples spanning six pillars—single-image embodied QA, image pointing/detection, video embodied QA, multi-image and ego–exo correspondence, and two synthetic diagnostics (CLEVR, GRiD-3D)—with an 8% fraction of Tulu-3 text included. Sequence length is 4,200, batch size 64, and learning rates 5e-6 (vision & connector), 1e-5 (LLM).
- Stage 2 (Joint Refinement): Continue for 1.5K steps on a 50/50 embodied-to-original mixture (longer seq-len 16,384, batch=1). The embodied weight is selected by Pareto performance over 13 embodied-reasoning tasks.
This approach enables Molmo2-ER to exceed the performance of both open and many closed VLMs, achieving 63.8% average accuracy on embodied-reasoning datasets (compared to 61.3% for GPT-5 Thinking and 61.0% for Gemini-ER1.5).
1.2 OpenFAST Action Tokenizer
OpenFAST compresses 1 second of continuous 32-dimensional robot controls into discrete tokens:
- Pipeline: Normalize actions to 1–99th percentile, apply a frequency-domain transform (e.g., Fast Fourier), quantize coefficients, and apply BPE tokenization over a 2,048-token vocabulary.
- Training Data: 1M randomly subsampled action chunks from five embodiments: 30% each from MolmoAct2-BimanualYAM, SO100/101, DROID, and 3.3% each from BC-Z, BridgeData V2, RT-1.
- Compression Ratio: Typically 8–20 tokens per second.
1.3 Flow-Matching Continuous-Action Expert
Following discrete policy pre-training, a 36-layer DiT-style flow-matched continuous policy ("flow expert") is attached via per-layer key–value (KV) cache conditioning:
- Objective: For normalized chunk , noise , , define:
Minimize the loss:
where masks padded time steps/dimensions and is VLM context.
- KV-Cache Conditioning: From each VLM layer , keys and values 0 are projected into the expert head space and cross-attended by the expert block:
1
Gradients are detached for knowledge insulation; 2 only updates the expert and its adapters.
- Post-Training Joint Objective:
3
1.4 MolmoAct2-Think: Adaptive-Depth Reasoning
MolmoAct2-Think introduces an adaptive-depth interface to mitigate VLM grounding latency:
- Depth Prediction: A depth-token prefix encodes a 4 grid of 128 VQ-codes from a pretrained VQ-VAE model.
- Adaptive Update: At each timestep, 10×10 RGB patch similarities to prior frames are computed. Only cells with cosine similarity 5 are autoregressively re-predicted; unchanged cells replay cached codes, reducing depth decode cost commensurately.
- Training: Mixed action-only, depth-only, depth-and-action examples; 10% noise injection to depth codes; learned per-layer depth gate on expert KV conditioning. Fine-tuning on LIBERO yields a +0.9% average success lift.
2. Training Data and Dataset Curation
MolmoAct2’s high performance is grounded in a suite of curated, open datasets spanning a range of hardware and task diversity.
Summary of MolmoAct2 Datasets
| Dataset | Hours | Demos/Episodes | Key Features |
|---|---|---|---|
| BimanualYAM | 720 | 34.5K demos | <$6k bimanual robot; 3 RGB cams; 28 tasks |
| SO100/101 | 184 | 38,059 | 1,222 HuggingFace users; rigorous quality filtering |
| DROID (Franka) | — | 74,604 | Idle segment filtering, extended annotations |
| Other (BC-Z, RT-1, etc.) | — | — | Smaller academic and web mixtures |
BimanualYAM: Collected on a <$6k bimanual robot, 34,500 demos, 720 hours, over 28 tasks with rich scene variability and quality-controlled protocols.
SO100/101: Aggregated from 1,222 LeRobot HuggingFace users; 38,059 episodes, ~184 hours, with filtering for schema validity, licensing, codebase eligibility, and a TOPReward gate for quality.
DROID: Filtered Franka dataset; 74,604 successful episodes with idle removal and expanded natural language annotations.
Auxiliary Sources: BC-Z, BridgeData V2, RT-1, MolmoAct1 for broader embodiment coverage.
Language re-annotation leverages Qwen3.5-27B to increase lexical diversity, doubling unique instruction token frequency from 22% to 46%.
3. Training, Optimization, and Fine-Tuning Procedures
3.1 Pre-Training (MolmoAct2-Pretrain)
- Data Loader: 90% robot, 10% multimodal.
- Robot Mixture: 30% each BimanualYAM, SO100/101, DROID; remainder from smaller sources.
- Optimizations: AdamW with decoupled weight decay; learning rates 5e-6 (vision/connector) and 1e-5 (LLM); 200K updates with a global batch of 128 (64 × 2 nodes × 4 GPUs), ~5,800 GPU-hours.
- Sequence Packing: Dynamic, up to 4,200 tokens per sample from text, image, state, and action.
3.2 Post-Training (MolmoAct2-Post)
- Add flow expert and per-layer KV bridge.
- Same data mixture but shorter robot sequence length (2,100).
- Four flow samples per chunk, joint objective 6.
- Expert learning rate 5e-5; 100K updates at global batch 128, ~2,300 GPU-hours.
3.3 Embodiment-Specific Fine-Tuning
- Identical architecture; robot-only data; eight flow samples per chunk; gradients unblocked to VLM for adaptation.
- YAM: 100K updates @64 GPUs.
- DROID: 74,604 episodes, 100K updates @32 GPUs.
- SO100/101: randomize camera order, same budget.
- LIBERO: front + wrist cams, sampled at 10 Hz, 50K updates @32 GPUs.
4. Empirical Evaluation and Benchmarks
MolmoAct2 establishes new state-of-the-art on both open and closed VLA and embodied-reasoning benchmarks. Performance metrics are strictly reported as in (Fang et al., 4 May 2026).
4.1 Simulation and Real-World Task Benchmarks
- Simulation: MolmoSpaces (single-cycle pick-and-place), MolmoBot (challenging pick-and-place), 1,000 episodes per task.
- MolmoBot avg success: 35.6% vs. 18.1% (7 baseline)
- MolmoSpaces avg: 20.6% vs. 10.0%
- Real-World Zero-Shot:
- DROID (5 tasks, 15 trials each): 87.1% vs. 45.2% (8), 48.4% (MolmoBot)
- SO100/101 (5 tasks, 15 trials): 56.7% vs. 45.3% (9) and 2.3% (SmolVLA)
4.2 Embodied-Reasoning and Downstream Fine-Tuning
- Embodied-Reasoning (13 tasks): Molmo2-ER 63.8% avg, exceeding Gemini-ER1.5 (61.3%) and GPT-5 (57.9%).
- LIBERO: MolmoAct2 avg 97.2% vs. 0 96.9%, GR00T 97.0%.
- RoboEval: MolmoAct2 44.3% vs. 1 40.5%, MolmoAct1 30.2%.
- YAM Real-World Suite (8 tasks): MolmoAct2 50.1% vs. runner-up 35.1%.
- Adaptive-Depth (MolmoAct2-Think): LIBERO 98.1% vs. MolmoAct2 97.2%, with ablations dropping to 97.65%/97.50% for depth noise and gate, respectively.
5. Deployment and Practical Integration
5.1 Hardware and Latency
- Training: Up to 64 × A100/H100 GPUs.
- Inference (1 × H100, action horizon=10):
- MolmoAct2: 23.0 Hz baseline, 27.4 Hz (+1.19× with caching), 55.8 Hz (+2.04× CUDA Graph).
- MolmoAct2-Think: 8.0 Hz, 9.7 Hz, and 12.7 Hz after respective optimizations.
5.2 Ablations and Best Practices
- Per-layer KV conditioning surpasses hidden-state approaches.
- Joint discrete and continuous co-training and full fine-tuning yield better results than LoRA or expert-only alternatives.
- On-the-fly sequence packing, image augmentations, and randomized camera order promote generalization.
- For adaptive depth, per-layer depth gates and 10% train-time noise are recommended.
5.3 Real-World Integration Guidelines
- Match camera calibration and order to training metadata; randomize order for each embodiment class.
- Adhere to standardized normalization protocols and leverage the released OpenFAST code/data.
- Small-batch in-domain fine-tuning (<100K steps) is effective for rapid adaptation.
- Exploit cached KV and CUDA Graph APIs to sustain >50 Hz closed-loop control rates.
All model weights, training code, datasets, and OpenFAST components are released at https://github.com/allenai/molmoact2 and on HuggingFace, facilitating reproducible research and deployment (Fang et al., 4 May 2026).