MoMani: Mobile Manipulation Benchmark

Updated 29 November 2025

MoMani is a benchmark and dataset suite that rigorously tests long-horizon mobile manipulation in Vision-Language-Action models using both simulation and real-robot scenarios.
It employs an MLLM-driven planning pipeline with expert-level trajectory generation, integrating navigation, manipulation, and memory tasks.
MoMani provides precise metrics and an extensible API to enable robust evaluation and sim-to-real transfer for embodied AI systems.

MoMani is a benchmark and dataset suite designed to evaluate long-horizon mobile manipulation capabilities in generalist Vision-Language-Action (VLA) models. Developed to stress test agents on tasks that require intricate coordination among navigation, manipulation, and spatial or semantic memory, MoMani offers an automated, scalable pipeline for generating expert-level trajectories using multimodal LLM (MLLM) planning and refinement, augmented by real-robot demonstrations. It addresses the current gap in embodied AI benchmarks by scaling beyond short, table-top manipulation to scenarios that demand persistent memory and sequential reasoning across hundreds of low-level control actions (Lin et al., 22 Nov 2025).

1. Benchmark Objectives and Task Taxonomy

The principal aim of MoMani is to provide a systematic, extensible evaluation platform for long-horizon mobile-manipulation tasks. Each episode requires a robot to execute a series of interleaved perception, navigation, grasping, object interaction (e.g., open/close, push/pull), and subsequent navigation actions to complete a single English-language instruction. MoMani distinguishes itself from prior embodied AI suites by enforcing trajectory lengths ranging from 600 to 800 primitive actions per episode, in contrast to the 278-step RoboCasa baseline. All tasks are instantiated with realistic spatial and semantic variation, both in simulation and on real hardware.

The benchmark categorizes tasks as follows:

TOF (Tidy Operation–Fetch): Fetching and tidying operations requiring retrieval and relocation of objects.
PnPS2C (Pick-and-Place: Sink → Counter): Pick-and-place from the sink to the counter.
PnPC2S (Pick-and-Place: Counter → Sink): Pick-and-place from the counter to the sink.
TOS (Tidy Operation–Sort): Sorting operations involving multi-object manipulation and placement.
Baselines: RoboCasa (short-horizon “everyday tasks” with ∼278 steps) and Nav-only (pure navigation without manipulation).

Simulation environments use RoboCasa kitchen and living room scenes with ∼50 variations per task category, while real-world settings consist of a lab kitchen with appliances (refrigerator, microwave, cabinet, sink) and 4–8 randomized trials per task (Lin et al., 22 Nov 2025).

2. Data Generation Pipeline

MoMani employs an automated pipeline that begins with MLLM-guided planning, utilizing an in-house adaptation of GPT-4 Vision as the central planner. Task inputs include both tokenized language instructions (via BPE) and visual observations (RGB images and voxelized point clouds), the latter encoded by a ViT backbone and 3D sparse U-Net to yield per-voxel memory features.

High-level sub-goals are generated in sequence (e.g., “Navigate to cabinet → Open door → Grasp cup → Navigate to sink → Place cup”), and each sub-goal is mapped to joint-space actions by a low-level diffusion policy (50 denoising steps per action segment). In simulation, rollouts are performed with full-state access and zero noise to construct “expert” datasets.

Refinement is feedback-driven: rollouts are monitored for base or arm collisions, sub-goal completion, and step limit violations. On any failure, the MLLM is re-prompted with diagnostic information (failed sub-goal, latest perception), with at most three attempts per segment allowed (Lin et al., 22 Nov 2025).

For real-robot data, MoMani uses a TidyBot++ platform (holonomic base, 7-DoF arm), recording at 10 Hz (RGB, point cloud) and 20 Hz (joint states). Each expert demonstration is temporally and spatially aligned to simulated layouts using ICP.

3. MoMani Dataset Structure and Statistics

The resulting dataset comprises 4,000 simulated trajectories (~2.7 million steps) and extensive real-world records. Detailed statistics for the four principal tasks and baselines are as follows:

Task	# Trajectories	Avg. steps/sim	Avg. steps/real
TOF	1,000	630.99	168.4
PnPS2C	1,000	755.56	191.0
PnPC2S	1,000	703.12	179.3
TOS	1,000	645.22	126.2
RoboCasa	500	278.00	–
Nav-only	200	103.62	–

The average action type distribution across all simulated trajectories is: Navigation (62%), Grasp/Release (12%), Open/Close (10%), Push/Pull (8%), Place (8%) (Lin et al., 22 Nov 2025).

Dataset format is structured for reproducibility and policy benchmarking:

Each simulation trajectory: .npz folder with per-timestep rgb_{t}.png, pc_{t}.ply, q_{t}.npy, \tau_{t}.npy (joint-space actions), and plan_{t}.json (sub-goal).
Real trajectories add odom_{t}.csv from T265 tracking.

The official MoMani code and dataset are hosted at https://github.com/stanford-robotics-lab/momani.

4. Evaluation Metrics and Protocol

MoMani enforces rigorous performance measurement through four standard metrics, allowing for detailed comparison of navigation, manipulation, and integrated policies:

Success Rate (SR):

$\text{SR} = \frac{\text{Number of successful episodes}}{\text{Total episodes}}$

Normalized Path Length (SPL) [Anderson et al. 2018]:

$\text{SPL} = \frac{1}{N}\sum_{i=1}^N \frac{\ell_i^*}{\max(\ell_i,\,\ell_i^*)}$

where $\ell_i$ is the episode path length and $\ell_i^*$ is the geodesic shortest path.

Mobile-Manipulation Reward (MMR):

$\text{MMR} = \alpha \cdot \mathrm{SR}_{\mathrm{nav}} + (1-\alpha) \cdot \mathrm{SR}_{\mathrm{manip}}$

with $\alpha=0.5$ by default.

Task-Completion Time (TCT): Average primitive steps required for success, with a +10% penalty for episodes involving a collision.

Episode success requires completion of all sub-goals within an absolute tolerance of ±5 cm (position) and ±5° (orientation), with an explicit collision penalty if any contact exceeds 0.5 s (Lin et al., 22 Nov 2025).

5. Usage, Access, and API

MoMani offers standardized access via a Python library (momani_api.py), which supports:

Loading and parsing of trajectory folders into observation, action, and plan representations.
Single-call evaluation of user policies via evaluate_policy(policy_fn, task="TOF", n_episodes=100), returning relevant metrics (SR, SPL, MMR).

Both simulated and real datasets follow an identical organizational schema to support seamless transfer and benchmarking. Time- and space-aligned real-world demonstrations facilitate sim-to-real analyses and cross-validation of learned VLA models (Lin et al., 22 Nov 2025).

6. Example Episode and Planning Process

A canonical episode (e.g., “Pick red mug from cabinet and place on counter”) illustrates MoMani’s pipeline:

Inputs: Language: “Go to the white cabinet, open its lower door, pick up the red mug, navigate to the counter next to the sink, and place the mug upright.” Initial observations include RGB-D of cabinet face and counter.
MLLM Planning: Outputs a stepwise sub-goal sequence. Failures (e.g., arm collision at step 52) trigger re-prompting and sub-goal revision (e.g., “Approach door at 30 cm, align heading ±5°”).
Feedback Refinement: Each segment continues until sub-goal criteria are met. In the documented expert trajectory, the final sequence comprises: base motion (0–100 steps), open door (100–150), grasp (150–210), navigation (210–380), and final placement (380–420).

Aggregate steps: 420, with all goals satisfied and zero persistent collision (Lin et al., 22 Nov 2025).

7. Context and Significance

MoMani represents a shift in embodied AI evaluation by offering an extensible, memory-centric, and MLLM-driven framework for long-horizon mobile manipulation. The combination of simulated expert rollouts, adaptive feedback refinement, and real-robot alignment enables robust testing for policies that must operate reliably across varied and complex spatial environments.

By supplementing its substantial synthetic dataset with aligned real-world demonstrations, MoMani explicitly supports sim-to-real policy transfer and ablation. Its multi-faceted metrics and API design facilitate direct benchmarking of VLA models, driving further research in memory, planning, and robustness for physically embodied intelligent agents (Lin et al., 22 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation (2025)

MoMani: Mobile Manipulation Benchmark

1. Benchmark Objectives and Task Taxonomy

2. Data Generation Pipeline

3. MoMani Dataset Structure and Statistics

4. Evaluation Metrics and Protocol

5. Usage, Access, and API

6. Example Episode and Planning Process

7. Context and Significance

Whiteboard

Follow Topic

Continue Learning

MoMani: Mobile Manipulation Benchmark

1. Benchmark Objectives and Task Taxonomy

2. Data Generation Pipeline

3. MoMani Dataset Structure and Statistics

4. Evaluation Metrics and Protocol

5. Usage, Access, and API

6. Example Episode and Planning Process

7. Context and Significance

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics