Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

AIRoA MoMa: Mobile Manipulation Dataset

Updated 1 October 2025
  • AIRoA MoMa Dataset is a large-scale multimodal dataset designed for mobile manipulation research, featuring synchronized sensor streams and hierarchical annotations.
  • It comprises 25,469 episodes over 94 hours at 30 Hz with dual-view RGB images, proprioceptive data, force-torque signals, and teleoperation commands.
  • The dataset enables benchmarking of complex vision-language-action tasks and supports the development of error-resilient, contact-rich, long-horizon autonomous control models.

The AIRoA MoMa Dataset is a large-scale, real-world, multimodal dataset specifically designed for research in mobile manipulation. Collected using the Toyota Human Support Robot (HSR), it enables the investigation and advancement of robust Vision-Language-Action (VLA) models by providing synchronized sensor data streams, hierarchical annotation of complex tasks, and unique resources for benchmarking contact-rich, long-horizon manipulation in unconstrained human environments.

1. Dataset Structure and Modalities

The dataset comprises 25,469 episodes, totaling approximately 94 hours, with all data sampled at 30 Hz. Each episode records multimodal streams that capture the intricacies of mobile manipulation tasks:

  • Visual Data: Dual-view RGB images at a resolution of 480×640×3 pixels. One camera is head-mounted (global scene coverage), the other wrist-mounted (local manipulation context).
  • Proprioceptive Data: Includes the arm, gripper, lifting torso, and head joint states of the HSR—capturing both joint angles and velocities, essential for precise motion control.
  • Force-Torque Sensing: Six-axis wrist signals (Fx, Fy, Fz, Mx, My, Mz) crucial for contact-rich interaction and haptic reasoning.
  • Teleoperation Signals: Raw operator commands provide ground truth action trajectories and facilitate work on action representation learning.

All data is rigorously defined in terms of state and action spaces:

  • Joint state vector:

s={s(hsr_arm_lift),s(hsr_arm_flex),s(hsr_arm_roll),s(hsr_wrist_flex), s(hsr_wrist_roll),s(hsr_hand_motor),s(hsr_head_pan),s(hsr_head_tilt)}R8s = \{ s^{(\text{hsr\_arm\_lift})},\, s^{(\text{hsr\_arm\_flex})},\, s^{(\text{hsr\_arm\_roll})},\, s^{(\text{hsr\_wrist\_flex})},\ s^{(\text{hsr\_wrist\_roll})},\, s^{(\text{hsr\_hand\_motor})},\, s^{(\text{hsr\_head\_pan})},\, s^{(\text{hsr\_head\_tilt})} \} \in \mathbb{R}^8

  • Absolute action vector:

aabsolute={s(teleop_arm_lift),,s(teleop_head_tilt)}R8a_\text{absolute} = \{ s^{(\text{teleop\_arm\_lift})},\, \ldots,\, s^{(\text{teleop\_head\_tilt})} \} \in \mathbb{R}^8

  • Actions are also partitioned into arm (R5)(\mathbb{R}^5), gripper (R1)(\mathbb{R}^1), head (R2)(\mathbb{R}^2), and base (R3)(\mathbb{R}^3) subspaces.
  • Relative actions are defined as:

arelative={aarms(hsr_arm),s(teleop_gripper),aheads(hsr_head),abase}R11a_\text{relative} = \{ a_\text{arm} - s^{(\text{hsr\_arm})},\, s^{(\text{teleop\_gripper})},\, a_\text{head} - s^{(\text{hsr\_head})},\, a_\text{base} \} \in \mathbb{R}^{11}

Synchrony is ensured: RGB and sensor data are resampled to a consistent 30 Hz, with frames marked as stale in case of sensor dropout.

2. Hierarchical Annotation Schema

The dataset introduces a two-layer hierarchical annotation schema:

  • Short Horizon Task (SHT): The upper layer captures the high-level, natural language instruction representing the user's goal (e.g., "Bake a Toast"). This serves as the semantic context for the episode.
  • Primitive Actions (PA): Each SHT is decomposed into atomic actions (e.g., "Open Oven," "Pick Bread"). Every PA is annotated with a success/failure marker, facilitating detailed error analysis at both high-level intent and execution layers.

This dual-annotation enables hierarchical learning—models can be trained to decompose and sequence natural language instructions into primitives—and supports granular analysis and error recovery.

3. Technical Standards and Data Organization

The AIRoA MoMa Dataset is distributed in the LeRobot v2.1 format, a recognized standard in the robotics community, ensuring compatibility with contemporary imitation learning and foundation model toolchains.

Key standardization measures:

  • Data alignment and integrity: All modalities are time-synchronized; joint and force-torque data sampled at 100 Hz are downsampled to match the 30 Hz image streams. Sensor dropout is explicitly annotated by staleness flags on frames.
  • Media storage: Visual data is initially captured as PNG images, with post-processing conversion to video where appropriate. Hierarchical annotations are timestamped to define episode boundaries precisely.
  • Action representations: Both absolute and relative action commands are provided, allowing researchers to select control strategies best aligned with their modeling approach.

4. Research Applications and Benchmarks

As a benchmark, the AIRoA MoMa Dataset is designed for evaluating and training VLA models encompassing:

  • Mobile manipulation: Combining whole-body navigation and dexterous manipulation in household-like environments.
  • Contact-rich tasks: Incorporating force-torque feedback for learning behaviors involving physical interaction.
  • Long-horizon, structured activities: Supporting the development and assessment of models that plan and execute extended, multi-step tasks.

Given the synchronization of sensory, proprioceptive, linguistic, and command streams, the dataset supports diverse research avenues:

  • Training models that translate language to action sequences in contact-rich contexts.
  • Hierarchical reinforcement learning and imitation learning with ground truth subgoal and failure annotations.
  • Cross-modal foundation model development (e.g., RT-1, π₀, OpenVLA), enabling rigorous benchmarking of new architectures.

Table: Hierarchical Structure of Annotations

Layer Content Success/Failure Markers
SHT Natural Language Task Instruction Yes
PA Sequence of Primitive Actions Yes

5. Access and Usage

The dataset is publicly accessible at https://huggingface.co/datasets/airoa-org/airoa-moma. While licensing terms are not detailed in the foundational paper, standard usage terms are expected as per repository guidelines. Researchers should consult the repository for the latest licensing, citation, and usage details.

6. Significance and Prospects

By integrating multimodal sensor data, direct teleoperation signals, and hierarchically structured annotations, the AIRoA MoMa Dataset constitutes a critical benchmark for error-resilient, contact-rich mobile manipulation research. Its meticulous formatting and comprehensive structure position it as a key resource for the empirical validation of emerging VLA and autonomous robotic control models focused on real-world unstructured environments.

This suggests that the dataset may expedite research into decomposition of natural language-guided tasks and foster the development of robust, generalist agents capable of performing complex, sequentially organized household operations.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AIRoA MoMa Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube