Papers
Topics
Authors
Recent
Search
2000 character limit reached

MolmoAct2: Open VLA Action Reasoning Model

Updated 3 July 2026
  • MolmoAct2 is a fully open vision-language-action model that leverages a modular architecture and high-quality datasets to enable robust embodied reasoning.
  • It integrates a novel VLM backbone with open action tokenization and a flow-matched continuous policy expert for compressing continuous control data.
  • Empirical evaluations show MolmoAct2 outperforms both open and closed baselines in simulation and real-world benchmarks, achieving superior success rates.

MolmoAct2 is a fully open action reasoning model for vision-language-action (VLA) robotic control, engineered to overcome persistent barriers to real-world deployment. Addressing closed-source baselines, hardware exclusivity, prohibitive reasoning latency, and insufficient fine-tuned success rates, MolmoAct2 integrates a novel VLM backbone, open action tokenization, a flow-matched continuous policy expert, and adaptive-depth reasoning. Its modular, multilayered architecture and diverse, high-quality datasets enable robust performance across a wide spectrum of embodied reasoning and control tasks. Empirical results demonstrate consistently superior performance to leading open and closed models across both simulated and real-world benchmarks, with all model weights, code, and data made fully available for unrestricted academic and applied development (Fang et al., 4 May 2026).

1. Model Architecture and Components

1.1 Molmo2-ER Vision–Language Backbone

MolmoAct2 utilizes Molmo2-ER, a variant of Molmo2 (Qwen3–4B, mid-training checkpoint) as its VLM core. The model is refined via a "specialize-then-rehearse" scheme:

  • Stage 1 (Embodied Specialization): Fine-tune for 20K steps on 3.3M embodied samples spanning six pillars—single-image embodied QA, image pointing/detection, video embodied QA, multi-image and ego–exo correspondence, and two synthetic diagnostics (CLEVR, GRiD-3D)—with an 8% fraction of Tulu-3 text included. Sequence length is 4,200, batch size 64, and learning rates 5e-6 (vision & connector), 1e-5 (LLM).
  • Stage 2 (Joint Refinement): Continue for 1.5K steps on a 50/50 embodied-to-original mixture (longer seq-len 16,384, batch=1). The embodied weight p=0.5p=0.5 is selected by Pareto performance over 13 embodied-reasoning tasks.

This approach enables Molmo2-ER to exceed the performance of both open and many closed VLMs, achieving 63.8% average accuracy on embodied-reasoning datasets (compared to 61.3% for GPT-5 Thinking and 61.0% for Gemini-ER1.5).

1.2 OpenFAST Action Tokenizer

OpenFAST compresses 1 second of continuous 32-dimensional robot controls into discrete tokens:

  • Pipeline: Normalize actions to 1–99th percentile, apply a frequency-domain transform (e.g., Fast Fourier), quantize coefficients, and apply BPE tokenization over a 2,048-token vocabulary.
  • Training Data: 1M randomly subsampled action chunks from five embodiments: 30% each from MolmoAct2-BimanualYAM, SO100/101, DROID, and 3.3% each from BC-Z, BridgeData V2, RT-1.
  • Compression Ratio: Typically 8–20 tokens per second.

1.3 Flow-Matching Continuous-Action Expert

Following discrete policy pre-training, a 36-layer DiT-style flow-matched continuous policy ("flow expert") is attached via per-layer key–value (KV) cache conditioning:

  • Objective: For normalized chunk aa, noise ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I), t∈[0,1]t \in [0,1], define:

xt=(1−t)ϵ+ta,u⋆=a−ϵx_t = (1-t)\epsilon + t a, \quad u^\star = a - \epsilon

Minimize the loss:

Lflow=Ea,ϵ,t∥m⊙(fθ(xt,t,c)−u⋆)∥22\mathcal{L}_{\mathrm{flow}} = \mathbb{E}_{a,\epsilon,t}\left\| m \odot \left( f_\theta(x_t, t, c) - u^\star \right) \right\|_2^2

where mm masks padded time steps/dimensions and cc is VLM context.

  • KV-Cache Conditioning: From each VLM layer â„“\ell, keys Kâ„“vlmK^{vlm}_\ell and values aa0 are projected into the expert head space and cross-attended by the expert block:

aa1

Gradients are detached for knowledge insulation; aa2 only updates the expert and its adapters.

  • Post-Training Joint Objective:

aa3

1.4 MolmoAct2-Think: Adaptive-Depth Reasoning

MolmoAct2-Think introduces an adaptive-depth interface to mitigate VLM grounding latency:

  • Depth Prediction: A depth-token prefix encodes a aa4 grid of 128 VQ-codes from a pretrained VQ-VAE model.
  • Adaptive Update: At each timestep, 10×10 RGB patch similarities to prior frames are computed. Only cells with cosine similarity aa5 are autoregressively re-predicted; unchanged cells replay cached codes, reducing depth decode cost commensurately.
  • Training: Mixed action-only, depth-only, depth-and-action examples; 10% noise injection to depth codes; learned per-layer depth gate on expert KV conditioning. Fine-tuning on LIBERO yields a +0.9% average success lift.

2. Training Data and Dataset Curation

MolmoAct2’s high performance is grounded in a suite of curated, open datasets spanning a range of hardware and task diversity.

Summary of MolmoAct2 Datasets

Dataset Hours Demos/Episodes Key Features
BimanualYAM 720 34.5K demos <$6k bimanual robot; 3 RGB cams; 28 tasks
SO100/101 184 38,059 1,222 HuggingFace users; rigorous quality filtering
DROID (Franka) — 74,604 Idle segment filtering, extended annotations
Other (BC-Z, RT-1, etc.) — — Smaller academic and web mixtures

BimanualYAM: Collected on a <$6k bimanual robot, 34,500 demos, 720 hours, over 28 tasks with rich scene variability and quality-controlled protocols.

SO100/101: Aggregated from 1,222 LeRobot HuggingFace users; 38,059 episodes, ~184 hours, with filtering for schema validity, licensing, codebase eligibility, and a TOPReward gate for quality.

DROID: Filtered Franka dataset; 74,604 successful episodes with idle removal and expanded natural language annotations.

Auxiliary Sources: BC-Z, BridgeData V2, RT-1, MolmoAct1 for broader embodiment coverage.

Language re-annotation leverages Qwen3.5-27B to increase lexical diversity, doubling unique instruction token frequency from 22% to 46%.

3. Training, Optimization, and Fine-Tuning Procedures

3.1 Pre-Training (MolmoAct2-Pretrain)

  • Data Loader: 90% robot, 10% multimodal.
  • Robot Mixture: 30% each BimanualYAM, SO100/101, DROID; remainder from smaller sources.
  • Optimizations: AdamW with decoupled weight decay; learning rates 5e-6 (vision/connector) and 1e-5 (LLM); 200K updates with a global batch of 128 (64 × 2 nodes × 4 GPUs), ~5,800 GPU-hours.
  • Sequence Packing: Dynamic, up to 4,200 tokens per sample from text, image, state, and action.

3.2 Post-Training (MolmoAct2-Post)

  • Add flow expert and per-layer KV bridge.
  • Same data mixture but shorter robot sequence length (2,100).
  • Four flow samples per chunk, joint objective aa6.
  • Expert learning rate 5e-5; 100K updates at global batch 128, ~2,300 GPU-hours.

3.3 Embodiment-Specific Fine-Tuning

  • Identical architecture; robot-only data; eight flow samples per chunk; gradients unblocked to VLM for adaptation.
  • YAM: 100K updates @64 GPUs.
  • DROID: 74,604 episodes, 100K updates @32 GPUs.
  • SO100/101: randomize camera order, same budget.
  • LIBERO: front + wrist cams, sampled at 10 Hz, 50K updates @32 GPUs.

4. Empirical Evaluation and Benchmarks

MolmoAct2 establishes new state-of-the-art on both open and closed VLA and embodied-reasoning benchmarks. Performance metrics are strictly reported as in (Fang et al., 4 May 2026).

4.1 Simulation and Real-World Task Benchmarks

  • Simulation: MolmoSpaces (single-cycle pick-and-place), MolmoBot (challenging pick-and-place), 1,000 episodes per task.
    • MolmoBot avg success: 35.6% vs. 18.1% (aa7 baseline)
    • MolmoSpaces avg: 20.6% vs. 10.0%
  • Real-World Zero-Shot:
    • DROID (5 tasks, 15 trials each): 87.1% vs. 45.2% (aa8), 48.4% (MolmoBot)
    • SO100/101 (5 tasks, 15 trials): 56.7% vs. 45.3% (aa9) and 2.3% (SmolVLA)

4.2 Embodied-Reasoning and Downstream Fine-Tuning

  • Embodied-Reasoning (13 tasks): Molmo2-ER 63.8% avg, exceeding Gemini-ER1.5 (61.3%) and GPT-5 (57.9%).
  • LIBERO: MolmoAct2 avg 97.2% vs. ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)0 96.9%, GR00T 97.0%.
  • RoboEval: MolmoAct2 44.3% vs. ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)1 40.5%, MolmoAct1 30.2%.
  • YAM Real-World Suite (8 tasks): MolmoAct2 50.1% vs. runner-up 35.1%.
  • Adaptive-Depth (MolmoAct2-Think): LIBERO 98.1% vs. MolmoAct2 97.2%, with ablations dropping to 97.65%/97.50% for depth noise and gate, respectively.

5. Deployment and Practical Integration

5.1 Hardware and Latency

  • Training: Up to 64 × A100/H100 GPUs.
  • Inference (1 × H100, action horizon=10):
    • MolmoAct2: 23.0 Hz baseline, 27.4 Hz (+1.19× with caching), 55.8 Hz (+2.04× CUDA Graph).
    • MolmoAct2-Think: 8.0 Hz, 9.7 Hz, and 12.7 Hz after respective optimizations.

5.2 Ablations and Best Practices

  • Per-layer KV conditioning surpasses hidden-state approaches.
  • Joint discrete and continuous co-training and full fine-tuning yield better results than LoRA or expert-only alternatives.
  • On-the-fly sequence packing, image augmentations, and randomized camera order promote generalization.
  • For adaptive depth, per-layer depth gates and 10% train-time noise are recommended.

5.3 Real-World Integration Guidelines

  • Match camera calibration and order to training metadata; randomize order for each embodiment class.
  • Adhere to standardized normalization protocols and leverage the released OpenFAST code/data.
  • Small-batch in-domain fine-tuning (<100K steps) is effective for rapid adaptation.
  • Exploit cached KV and CUDA Graph APIs to sustain >50 Hz closed-loop control rates.

All model weights, training code, datasets, and OpenFAST components are released at https://github.com/allenai/molmoact2 and on HuggingFace, facilitating reproducible research and deployment (Fang et al., 4 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MolmoAct2.