AIRoA MoMa: Mobile Manipulation Dataset
- AIRoA MoMa Dataset is a large-scale multimodal dataset designed for mobile manipulation research, featuring synchronized sensor streams and hierarchical annotations.
- It comprises 25,469 episodes over 94 hours at 30 Hz with dual-view RGB images, proprioceptive data, force-torque signals, and teleoperation commands.
- The dataset enables benchmarking of complex vision-language-action tasks and supports the development of error-resilient, contact-rich, long-horizon autonomous control models.
The AIRoA MoMa Dataset is a large-scale, real-world, multimodal dataset specifically designed for research in mobile manipulation. Collected using the Toyota Human Support Robot (HSR), it enables the investigation and advancement of robust Vision-Language-Action (VLA) models by providing synchronized sensor data streams, hierarchical annotation of complex tasks, and unique resources for benchmarking contact-rich, long-horizon manipulation in unconstrained human environments.
1. Dataset Structure and Modalities
The dataset comprises 25,469 episodes, totaling approximately 94 hours, with all data sampled at 30 Hz. Each episode records multimodal streams that capture the intricacies of mobile manipulation tasks:
- Visual Data: Dual-view RGB images at a resolution of 480×640×3 pixels. One camera is head-mounted (global scene coverage), the other wrist-mounted (local manipulation context).
- Proprioceptive Data: Includes the arm, gripper, lifting torso, and head joint states of the HSR—capturing both joint angles and velocities, essential for precise motion control.
- Force-Torque Sensing: Six-axis wrist signals (Fx, Fy, Fz, Mx, My, Mz) crucial for contact-rich interaction and haptic reasoning.
- Teleoperation Signals: Raw operator commands provide ground truth action trajectories and facilitate work on action representation learning.
All data is rigorously defined in terms of state and action spaces:
- Joint state vector:
- Absolute action vector:
- Actions are also partitioned into arm , gripper , head , and base subspaces.
- Relative actions are defined as:
Synchrony is ensured: RGB and sensor data are resampled to a consistent 30 Hz, with frames marked as stale in case of sensor dropout.
2. Hierarchical Annotation Schema
The dataset introduces a two-layer hierarchical annotation schema:
- Short Horizon Task (SHT): The upper layer captures the high-level, natural language instruction representing the user's goal (e.g., "Bake a Toast"). This serves as the semantic context for the episode.
- Primitive Actions (PA): Each SHT is decomposed into atomic actions (e.g., "Open Oven," "Pick Bread"). Every PA is annotated with a success/failure marker, facilitating detailed error analysis at both high-level intent and execution layers.
This dual-annotation enables hierarchical learning—models can be trained to decompose and sequence natural language instructions into primitives—and supports granular analysis and error recovery.
3. Technical Standards and Data Organization
The AIRoA MoMa Dataset is distributed in the LeRobot v2.1 format, a recognized standard in the robotics community, ensuring compatibility with contemporary imitation learning and foundation model toolchains.
Key standardization measures:
- Data alignment and integrity: All modalities are time-synchronized; joint and force-torque data sampled at 100 Hz are downsampled to match the 30 Hz image streams. Sensor dropout is explicitly annotated by staleness flags on frames.
- Media storage: Visual data is initially captured as PNG images, with post-processing conversion to video where appropriate. Hierarchical annotations are timestamped to define episode boundaries precisely.
- Action representations: Both absolute and relative action commands are provided, allowing researchers to select control strategies best aligned with their modeling approach.
4. Research Applications and Benchmarks
As a benchmark, the AIRoA MoMa Dataset is designed for evaluating and training VLA models encompassing:
- Mobile manipulation: Combining whole-body navigation and dexterous manipulation in household-like environments.
- Contact-rich tasks: Incorporating force-torque feedback for learning behaviors involving physical interaction.
- Long-horizon, structured activities: Supporting the development and assessment of models that plan and execute extended, multi-step tasks.
Given the synchronization of sensory, proprioceptive, linguistic, and command streams, the dataset supports diverse research avenues:
- Training models that translate language to action sequences in contact-rich contexts.
- Hierarchical reinforcement learning and imitation learning with ground truth subgoal and failure annotations.
- Cross-modal foundation model development (e.g., RT-1, π₀, OpenVLA), enabling rigorous benchmarking of new architectures.
Table: Hierarchical Structure of Annotations
Layer | Content | Success/Failure Markers |
---|---|---|
SHT | Natural Language Task Instruction | Yes |
PA | Sequence of Primitive Actions | Yes |
5. Access and Usage
The dataset is publicly accessible at https://huggingface.co/datasets/airoa-org/airoa-moma. While licensing terms are not detailed in the foundational paper, standard usage terms are expected as per repository guidelines. Researchers should consult the repository for the latest licensing, citation, and usage details.
6. Significance and Prospects
By integrating multimodal sensor data, direct teleoperation signals, and hierarchically structured annotations, the AIRoA MoMa Dataset constitutes a critical benchmark for error-resilient, contact-rich mobile manipulation research. Its meticulous formatting and comprehensive structure position it as a key resource for the empirical validation of emerging VLA and autonomous robotic control models focused on real-world unstructured environments.
This suggests that the dataset may expedite research into decomposition of natural language-guided tasks and foster the development of robust, generalist agents capable of performing complex, sequentially organized household operations.