RoboSet: Comprehensive Robotics Dataset
- RoboSet is a comprehensive dataset defined by multi-modal, synchronized, and precisely annotated sensor data across various robotics platforms.
- It integrates diverse sensor suites from aerial and ground robots to support rigorous evaluation protocols in perception, navigation, and manipulation tasks.
- The dataset’s rich annotations and benchmark tasks provide actionable insights for multi-agent learning, safety metrics, and real-world deployment challenges.
A comprehensive dataset, frequently identified as "RoboSet" in multiple major robotics benchmark efforts, refers to a large-scale, multi-modal, and precisely annotated resource enabling advanced research in robot perception, interaction, navigation, and manipulation. Such datasets are defined by dense sensor coverage, rich task diversity, synchronized and calibrated data streams, and support for rigorous evaluation protocols in settings reflecting real-world operational complexity. Recent initiatives—including "CoPeD" for multi-robot collaborative perception (Zhou et al., 2024), "RoboSense" for egocentric navigation (Su et al., 2024), and "RH20T" for multi-skill manipulation learning (Fang et al., 2023)—have set new standards for comprehensiveness along the axes of environment, tasks, sensors, and benchmarking.
1. Platform and Sensor Suite Diversity
Comprehensive datasets target generalizable robotic intelligence by capturing data with heterogeneous sensor suites across a wide array of physical platforms and environments.
- CoPeD (Zhou et al., 2024): Five robots (2 aerial quadrotors, 3 ground vehicles) acquire data in challenging indoor/outdoor scenes. Coverage includes stereo RGB-D, high-res mono/stereo RGB, IMU, barometer, GNSS, 3D LiDAR. Aerial payloads are SWaP-limited (mass ≈ 1.2 kg), altitudes 2–10 m; ground robots (∼100–200 kg) allow heavier arrays, e.g., Ouster OS1-128.
- RoboSense (Su et al., 2024): Mounted on a custom mobile robot ("robosweeper"), the suite comprises 4× pinhole RGB (1920×1080@25 Hz, ∼111.8°×63.2°), 4× fisheye (1280×720@25Hz, 180×180°), 4 side LiDARs, 1 top LiDAR (e.g., Hesai Pandar40M), RTK-grade GPS/IMU, and 11 ultrasound sensors for collision safety.
- RH20T (Fang et al., 2023): Records 110,000+ contact-rich manipulation sequences across 7 robot configurations (UR5, Flexiv, Franka, KUKA arms), four grippers, various force/torque and tactile sensors, with dense RGB-D (static and in-hand), audio, and proprioceptive streams.
All platforms synchronize their sensors via NTP; synchronization granularity is generally ≤100 ms (RoboSense) or sub-millisecond (CoPeD, RH20T), supporting multi-modal fusion.
2. Dataset Scale, Structure, and Coverage
Comprehensiveness is reflected in both the scale and diversity along dimensions such as scenario, viewpoint, and agent/task heterogeneity.
| Dataset | Sequences | Keyframes | Tasks / Skills | Modalities |
|---|---|---|---|---|
| CoPeD | 6 (multi-robot) | ≈203,400 cam, 80,000 LiDAR | Multi-robot, indoor/outdoor, perception | RGB-D, LiDAR, GNSS, IMU |
| RoboSense | 7,619×20 s | 133,000+ (1 Hz) | Navigation, detection, 3D MOT, etc. | RGB, Fisheye, LiDAR, GPS |
| RH20T | 110,000+ | ≈220,000 (episodes) | 147 tasks / 42 skill categories | RGB-D, F/T, audio, joints |
- CoPeD: Scenarios span three indoor and three outdoor sites (total ≈2.3h), with each robot contributing synchronized ROS "bag" files per sequence (15–25 GB/seq).
- RoboSense: 42 h of robot navigation across 22 locations, distilled into 7,619 hand-picked segments (each 20 s), yielding 133k annotated frames with 1.4M 3D bounding boxes and 216k object trajectories.
- RH20T: Covers 147 manipulation tasks (from RLBench, MetaWorld, and new tasks), with paired human demonstration videos and plain-text language descriptions, yielding ≈750 episodes per skill category.
3. Annotation Protocols and Data Structuring
Precision and richness of annotation are central to enabling reproducible, transferable research.
- CoPeD: Delivers raw sensor data in ROS, pose estimation in CSV logs (timestamp, pose as in global ENU frame), and optional high-level annotations:
- 2D instance masks (PNG), COCO-style bounding boxes (JSON), zero-shot depth via Zoedepth (16bit PNG / NumPy).
- All robots’ data are spatially indexed by continuous AprilTag pose cues for aerial-to-ground supervision.
- RoboSense: Annotations follow a three-stage process: (1) pre-trained detector for proposals; (2) expert labeling/refinement in 360° pointclouds (augmenting near-field occluded objects); (3) pruning by per-sensor visibility. Each 3D box: plus unique trajectory ID. Occupancy voxels labeled as “occupied”/“free”/“unknown.”
- Privacy: faces, plates, signs masked in all raw and derived modalities.
- RH20T: For each robot episode, the data directory contains multi-camera RGB-D (1280×720, 10 Hz), depth, IR, F/T CSV (wrist, 100 Hz; fingertip tactile @200 Hz), audio WAV (16 kHz), synchronized joint states, end-effector pose, metadata JSON (description, calibration, quality score), and the matching human demonstration. Language and video alignments are via unique IDs and synchronized markers.
Calibration is central: all sensors are extrinsically and intrinsically calibrated. Camera-to-base projection as . Wrist F/T transforms employ the adjoint operator: .
4. Benchmark Tasks, Metrics, and Evaluation Protocols
Benchmarks define task objectives and ensure comparability across algorithms, with annotated splits for pretraining, validation, and held-out evaluation.
- CoPeD (Zhou et al., 2024):
- Tasks: (a) Multi-View Monocular Depth Estimation, (b) Multi-Agent Semantic Segmentation.
- Baseline: Isolated single-robot models vs. collaborative GNNs exchanging compressed feature tokens.
- Metrics:
- Depth: RMSE (m), AbsRel error; collaborative: RMSE 0.35 m, AbsRel 0.15 (25% RMSE reduction over single).
- Segmentation: mean IoU (mIoU); collaborative 0.75 vs. baseline 0.62 (21% improvement).
- RoboSense (Su et al., 2024):
- Tasks: (1) Multi-view camera 3D detection, (2) LiDAR-only 3D detection, (3) Multi-modal fusion, (4) 3D MOT, (5) Motion prediction, (6) Occupancy prediction.
- Metrics:
- Detection: mAP, AOS, ASE.
- Tracking: sAMOTA, AMOTP, ID switches; all use the new Closest-Collision-Point criterion: , in .
- Forecasting: minADE, minFDE, Miss Rate, End-to-end Prediction Accuracy.
- Occupancy: Per-class IoU in 3D and BEV, within fixed ranges.
- RH20T (Fang et al., 2023):
- Tasks: One-shot/few-shot manipulation imitation and transfer, tested on grasp-and-place and multi-stage episodic tasks.
- Metrics: Success Rate (stage-based), Imitation Loss (, MSE over actions), using formulas:
Pretraining on RH20T improves few-shot transfer by 10–20% absolute, and generalization in novel contexts by >15%.
5. Synchronization, Fusion, and Transformations
Achieving spatiotemporal alignment and enabling multi-modal and multi-agent fusion differentiate comprehensive datasets from legacy alternatives.
- CoPeD: NTP-synced clocks (Masterclock GMR1000) induce <3 s/year drift; all ROS topics timestamped with sub-ms precision. Sensor streams at distinct rates are fused by nearest-neighbor or linear interpolation:
Fusion across robots only occurs if with to enforce real-time constraints and emulate intermittent, asynchronous networking.
RoboSense: All sensors NTP-synced; global timestamps sampled every 100 ms, with per-device alignment yielding coherent 10 Hz streams. Multiple coordinate frames (Ego-vehicle, Camera, LiDAR, Pixel, Global) allow for sensor fusion and map registration.
RH20T: Centralized timestamping on a workstation aligns visual, audio, force, and proprioceptive data; calibration is validated prior to each acquisition session.
Coordinate transformations use standard rigid-body () conventions. For CoPeD:
6. Comparative Landscape and Research Use Cases
Comprehensive datasets unlock advanced research by improving data density, ecological validity, and evaluation transparency relative to previous resources.
Coverage and Density:
- RoboSense’s near-field (≤5 m) density: 173k annotated boxes vs. KITTI (~638), nuScenes (~9.8k), yielding up to 270× more data for close-proximity navigation and interaction.
- CoPeD features multi-agent, multi-modal, spatially overlapping robot perspectives unavailable in legacy SLAM-focused datasets.
- Annotated Tasks and Protocols:
- RoboSense defines six standard tasks with curated splits, public online leaderboard, and privacy-preserving protocols—no faces, plates, signs in released data.
- RH20T forms the most diverse real-world resource for foundation-model and one-shot imitation learning, spanning force, vision, tactile, and natural language aligned to demonstration.
- Applications:
- Enables social robotics, last-meter delivery, adaptive manipulation, robust multi-agent perception, safety-critical navigation, and foundation model pretraining.
- Facilitates research in occlusion handling, fusion under asynchronicity, and robust perception under real-world sensor noise and failure.
A plausible implication is that the new matching and occupancy protocols (e.g., RoboSense’s distance-proportional criteria) will recalibrate benchmark goals toward safety and social-awareness in robotics, especially in dense, occluded, and dynamic human-robot environments.
7. Limitations and Future Scope
While "comprehensive" datasets markedly increase the realism and scope of robotic learning and evaluation, current limitations include restricted action classes or environment types, annotation framerate (e.g., 1 Hz for RoboSense), and the absence of certain capabilities (e.g., dual-arm or dexterous hand data in RH20T; no HD map in RoboSense).
Established usage protocols recommend dataset pretraining followed by task-specific few-shot or zero-shot evaluation on held-out splits. Access is typically governed by standard open research licenses (e.g., CC-BY-NC-SA), with most resources accompanied by dedicated APIs and documentation for data loading, preprocessing, and calibration.
Summarily, comprehensive datasets such as CoPeD, RoboSense, and RH20T provide foundational infrastructure for next-generation research in distributed, multi-modal, collaborative, and socially aware robotics, forming the empirical backbone for advances in perception, planning, policy learning, and safety-driven evaluation (Zhou et al., 2024, Su et al., 2024, Fang et al., 2023).